For small software systems, perhaps a few thousand lines of code, software structure is largely an esthetic issue. We want to
build small neat structures out of professional pride. We would like readers of our code to find our intents clear and the code
behavior obvious. When software systems grow large, perhaps a million or more lines of source code, its structure becomes much
more important. Developing a large system requires teams of developers working in concert to provide a finished product in a reasonable amount of
time. That means that many people will read each component of our code - all those who's code depends on ours, quality assurance
personnel, and our team leads, charged with meeting schedules with quality products.
Our research group has been, and continues to be interested in finding ways to successfully create and maintain software systems
that are larger than any one person can completely comprehend from its code alone. We need abstractions and tools to help understand and discuss these systems, ways
of building them with reasonable amounts of effort and expense, and ways of packaging, shipping, and maintaining them over their
effective lifetimes. Much of our effort has be directed to understanding and managing software structures and repairing them when
the structures are flawed.
Software structure is expressed in several ways:
-
Placement Tree
Object Oriented Languages like C++, C#, and Java obey a set of inclusion rules that govern how key constructs
are placed within source code. The constructs that will interest us here are: namespaces, interfaces,
classes, structs, enums, functions, controls, and data declarations.
We will see that this placement structure is a tree - each element has only one parent.
Of the constructs cited here, all but data declarations are each endowed with a scope.
Scope is defined by all code enclosed within braces, e.g., "{" and "}". Namespace, interface, class, struct,
and enum scopes are compile-time only constructs.
They add and qualify names in the compiler symbol table and interfaces, classes, and structs determine access rules
and syntax for functions and data contained within their scope, and enums simply define a set of named integer constants.
Function and control scopes have both compile-time and run-time effects. At compile time they serve to provide names and
define syntax. At run time when a function or control scope is entered by the thread of execution it is provided a stack
frame that serves as scratch-pad memory, valid only for the time that execution resides within that scope. Local data can
be declared withing these scopes and if the data is an instance of a C++ class or struct its destructor is invoked when execution
leaves the scope.
The placement tree is a tree of scopes with the global namespace as its root. The rules that govern contruction of the
scope tree are as follows:
-
Namespaces
Namespaces can be composed (we sometimes say nested) and/or sequenced but must be declared at the global level.
That is, they cannot be declared inside any non-namespace scope. The outer namespace is the global namespace.
It has no name, is not declared, and includes everything not enclosed in a declared namespace.
Namespaces can include:
- other namespaces, interfaces, classes, structs, enums, functions, and data
-
Interfaces, Classes, and Structs
Classes and structs can be declared and may be defined any place a declaration is accepted by the programming language.
They can include:
- other interfaces, classes and structs, enums, functions, and data declarations
-
Enums
Enums can include only named constant integers that may have an associated value specifier.
-
Functions
Functions can be declared only within namespaces and classes. Functions are the only place that program behavior
is defined.
They can include:
- Interfaces, classes, structs, controls, data declarations and data definitions.
-
Data declarations and definitions
Data declarations can not include any of the other constructs discussed here. They are always a leaf of the placement
tree.
Code restructuring entails making behavior preserving modifications to the placement tree. The goal of restructuring is
to make the code more understandable, maintainable, and to reduce undesireable dependency effects.
These operations are exclusively:
-
Extracting new Functions
Adding a new function as a child of a namespace or class and moving some branch of the tree inside the new function and
adding a function invocation at the original site of the moved branch.
-
Extracting new Classes and Structs
Adding a new class as a child of a namespace or function and moving functions and data declarations within its scope
by making them children of the new class or struct.
-
Adding new Interfaces
Adding a new interface has a small syntactical effect, but a major effect on the code's dependency structure, discussed
next.
-
Dependency Graphs
Unlike placement, dependency is a graph relationship. Any software element may have more than one other element that depends
upon it, e.g., a parent in the software's dependency graph. There are several different graph representations useful for understanding
software structure:
-
Function Call Dependency Graph
One function depends on another if, and only if, it invokes the other. The graph of function call dependencies is often
referred to as a structure chart. That name is a hold-over from the days of structured programming which focused on partitioning
software into functions. Today, for understanding, we tend to focus more on classes and packages rather than functions. That is one of our
abstraction tools that help us understand large systems. But even now, when we need to understand the details of how a package
works, function call dependency graphs can be very useful.
-
Package Dependency Graph
Package A depends on package B if:
-
Package B declares and/or defines a namespace, interface, class, struct, enum, function or datum cited in A, e.g., any of
the constructs that define package B's placement tree.
The citation may be to create an instance, include in a derivation chain, invoke, or include in an expression.
Package dependencies determine both build and test strategies. For example, all mutually dependent packages must be
considered a single unit of testing, for a change to any one package in that set may break any other element of the set.
In the language of graph theory, these sets of mutually dependent packages are called strong components.
-
Data Use by Line Number, a Bipartite Graph
When restructuring code, one of the essential things we need to track are the locations of all references to data.
One way of representing that is with a bipartite graph. A bipartite graph is a graph with vertices divided onto two
sets with relationships only between an element of one set with an element of the other set.
The bipartite graph we are interested in is one in which one set of vertices represents each datum (either primitive or
user defined) and the other set of verticies represents the line numbers where that datum is cited. So, for each datum,
the graph contains relationships that define all the line numbers where that datum was used.
-
Data Composition Dependencies
One member of our group, Mehmet Kaya, is currently working on the code restructuring problem, and believes that data
dependencies will be important for extracting new classes from existing code. These dependencies arise from composition
operations. A new value for an existing datum is defined by an expression that uses values of one or more other data. So
there exists a potentially interesting data composition graph that may find use in defining new cohesive classes in code
that could benefit from additional structure.
Dependency graphs can be restructured too by inserting interfaces to break implementation dependencies, by shifting some package
contents to another package for reasons of size, complexity, or cohesion, and by extracting code from a package to form one or
more new packages.
-
Visualization
Each package in a system defines one and only one placement tree. If package sizes are reasonable, say at most a few hundred
lines of code, the placement tree of any of them will be of reasonably modest size, visualizable by fairly routine means, say a
well laid-out tree in a single view.
Dependency graphs are an entirely different matter. A large system may have thousands of packages with correspondingly large
dependency graph. And function calling sequences are not constrained by package boundaries so may also become quite large. How
can we represent these large structures in a way that is meaningful to a human observer? One of our group, Mubarek Mohammed,
is beginning to think about some of these issues.
We need tools to discover and manipulate the relationships we've discussed above. But, without some kind of effective visualization
process a user will be overwhelmed with data. Visualization procedures are needed to render these large collections of data into
meaningful information.
-
Leveling
One idea is to define levels withing the dependency relationships and, at first, show only the top few levels.
When the user identifies a region of interest, that vertex becomes a new visualization root and the next
few levels are displayed. This gives us a way to zoom into and explore regions of interest.
-
Semantic Tagging
Another idea is to use some form of semantic tagging to allow a user to explore and mark regions of interest. Namespaces give us
a means of defining a semantic group, but we may also need some, perhaps attribute marking scheme, encoded into the dependency
relationships, to identify a particular interesting point.
The format of this page is an experiment to see if some variations in page style work for this site. I think the jury is still
out on this question.