Ric Holt, Department of Computer Science, University of
Waterloo
27 Feb 97, Updated March 97, May 97, Nov 98, Jan 2002, May 2002, July 2002
holt@uwaterloo.ca
TA, the Tuple-Attribute language, is
intended to allow convenient recording of information about certain types of
graphs. This information includes (1) nodes and edges in the graph, and (2)
attributes of these nodes and edges. The information in a TA
"program" can be interpreted to be a special kind of data base, a graph
data base (GDB), where the relations (corresponding to the edges) are all
binary. This is the class of graphs formalized as a calculus of relations by
Tarski [1955]; see also [Holt 1997].
The impetus for designing TA comes from
the need for a convenient format for recording, manipulating and diagramming
information that represents the structure of large computer programs. However,
the design of TA is not restricted to this application. While TA is convenient
for recording how to draw graphs, by encoding positions and sizes as
attributes, TA is not uniquely a "drawing language".
TA can serve as a "data
interchange" language, used for exchanging information among a set of
software tools. For example, parsers can extract "facts" from source
language programs and can represent these in TA. These facts in turn can be
manipulated to create new "views", also expressed in TA, which in
turn can be displayed on a screen by a tool called a renderer. The Grok
interpreter [Holt 97] is an example of a tool which manipulates these facts.
The TA language is divided into two
sub-languages, namely, the Tuple and the Attribute sub-languages. TA is also a
two level language, one level for recording "facts" (an actual graph
with its attributes) and another level for recording the general scheme for the
graph (the classes of edges and of attributes). Within the Tuple parts, facts are represented in RSF notation.
The next two sections introduce the Tuple
and Attribute sub-languages, at the "fact" level.
Here is a part of a "TA
program", written in the Tuple subset of the TA language, which represents
nodes P, Q and V, with node P having a Call edge to Q, and node Q having a Ref
edge to V:
// Two edges in the Tuple language.
Call P Q
Ref Q V
In each triple, such as "Call P Q", we can think of P as the subject, Call as the verb and Q as the object; this could mean procedure P calls procedure Q. (This syntax for triples was invented in the Rigi Project [Muller 93], where it is called RSF, for Rigi Standard Form. We have extended this notation in various ways, as described in this note. The commenting convention, with comments starting with //, is borrowed from C++.)
The graph defined by this Tuple program
can be drawn as
P -------> Q -------> V
Call Ref
P, Q and V are called entities or nodes. The term "entity" is most natural when considering that the information in the Tuple program constitutes a database. The term "node" is most natural when we alternately consider that the information in the Tuple program represents a graph.
The two tuples represent the edges (P, Q)
and (Q, V) in the relations Call and Ref, respectively. The term
"tuple" or alternately "fact" is natural to use when we are
thinking about a database. The term "edge" is natural when we are
thinking about a graph.
We could also have other nodes, such as
main.c or "New York" and other relations, such as Declare or
is-owner-of. The syntax of TA allows entities and relations to be named by (to
have identifiers as) any string that contains no white space (or is quoted).
TA's concept of a graph is sometimes
called a multigraph or a typed graph or a colored graph, which means that there
can be more than one type (color) of edge. Between two nodes, there can be more
than one edge, given that the edges are of different types. In our example,
there are two types of edges, named Call and Ref. We will not use the term
"type", but will instead say that there are two
"relations", Call and Ref. In this example Call contains one edge (P,
Q) and Ref contains one edge (Q, V).
Aside. Besides interpreting a Tuple program as a graph or a
database, we can also interpret it is a Prolog program; Ullman [88] gives the
interpretation of data bases as Prolog (really, as DataLog) programs. For
example, the tuple "Call P Q" would be written in Prolog as Call(P,
Q), where it would mean that predicate Call is true for arguments P and Q. In
the Prolog interpretation, the term "fact" is used instead of
"tuple" or "edge".
To summarize what has been said so far,
the Tuple language has a syntax, namely, it is written has a sequence of
triples, and it also has a semantics. The semantics is that each triple is
interpreted as an edge in a typed graph. Nodes in this graph are distinct if they
have distinct names. (Later we will explain that the nodes can be divided into
"entity classes", analogous to the way in which edges are divided
into separate relations.) In the rest of this note, we will use the terms
"node" and "entity" interchangeably and the terms
"tuple", "triple", and "edge" interchangeably.
A number of "extractors" have
been written which parse source code in a particular language, and translate
the relations in the code into TA. In the PBS project, at theUniversities of
Toronto and Waterloo, there is a CFX parser, based on GCC, which inputs C and
outputs the RSF subset of TA. More recently, Malton and Dean modified GCC to
produce an open source extractor called CPPX, which inputs C++ code and outputs
the corresponding AST (abstract syntax tree) either in TA or in the GXL version
of XML [Malton 2001].
Grok [Holt 97] is a calculator that reads
and writes TA and RSF and manipulates the graph information. For example, using
Grok one can carry out operations such as finding the union of two relations
(by combining their tuples), finding composition of two relations (this
produces a new relation and is analogous to a "join"), finding the
transitive closure of a relation, etc. The operations in Grok are analogous to
the SQL operations in a relational database.
Each node and edge can have attributes.
For example, here are color attributes for the nodes and edges in the previous
example, written in the Attribute sub-language:
// Colors for nodes P, Q and V
P { color = red }
Q { color = yellow }
V { color = green }
// Colors for the two edges
(Call P Q) { color = black }
(Ref Q V) { color = gray }
Here are further attributes for the three
nodes that record positions and sizes for diagramming the graph:
// (x,y) positions and (width,height) sizes for nodes P, Q and V
P { x = 50 y = 100 width = 70 height = 30 }
Q { x = 150 y = 100 width = 70 height = 30 }
V { x = 250 y = 100 width = 70 height = 30 }
Here is another attribute giving the
description of node P:
// Description for node P
P { description = "This is a procedure named P" }
As you can see from these examples, an
attribute for a node or edge is given by writing the name of the node, such as
P, or writing the edge in the form (Call P Q), then a left brace {, then a list
of items of the form "name-of-attribute = attribute-value", all
followed by a right brace }.
The syntax of TA is free form, allowing
comments to appear anywhere, blank lines to appear anywhere, and
"tokens", such as names and special characters such as { and = to be
separated by white space, such as blanks and ends of lines. For example, we
could have written the (x,y) and size attributes for P this way:
P { x = 50 y = 100
width = 70
height = 30 }
The examples of attributes that we have
given in this section are suitable for recording how to draw our example graph.
For example, node P would be drawn in color red, with its top left corner at
(x,y) location (50, 100), with a width and height of 70 and 30 pixels
respectively. This interpretation is not inherent in the TA language, but
rather is a convention used by certain renderers (automatic picture drawers)
that use TA to give the format for their input.
The TA language itself does not place any
particular interpretation upon the attributes. Rather, like a database, TA
simply serves as a means for recording these values. However, a set of tools,
such as the Landscape tools [Mancoridis 95] for drawing software architecture
diagrams, uses the convention we have illustrated here for recording color, x,
y, width and height information. At the Universities of Toronto and Waterloo, a
set of software tools has been produced which reads and manipulates TA programs
that use this convention. For example, there is a layout program, LSLAYOUT that
reads in tuples and automatically produces the x and y information to describe
a diagram in which there are minimal crossings of edges. (These programs
require that colors must be recorded as RGB values such as (.5 .2 .8) instead
of names such as "red".)
We use attributes to attach values to
entities, such as red to P, when the value can be thought of as being
"local" to the entity. We think of P's set of attributes as a record
that "belongs" to P or is contained in P.
Up to this point we have introduced the
"fact" level of the TA language, which is used to describe a graph
and its attributes. We will now introduce the "meta" or
"scheme" level of TA, which is used to describe the general
"shape" of a graph. The scheme level of TA uses essentially the same
syntax as the fact level, but now the notation is used to record higher level
information; in particular, the scheme level describes what is commonly known
as an E/R Diagram [Chen 77]. (Note: a "scheme" is commonly called a
"schema". In this section we
will use the term “scheme” because that term is part of the TA syntax.)
In the example we have given, there were
two types (classes) of edges, namely, Call and Ref edges. There were also two
types (classes) of nodes, namely, procedure nodes (P and Q) and variable nodes
(V).
We can record these conventions as
follows
Call Proc Proc
Ref Proc Var
What these two lines mean is that
procedures Call procedures and procedures Ref (reference) variables. This
implies that there are Call and Ref relations and that there are Proc and Var
entity classes. (Note that an "edge class" is called a
"relation".) In other words, an actual graph can have Call edges from
Procs to Procs as well as Ref edges from Procs to Vars, but no other edges.
When combining fact level and scheme
level notation in the same file, we separate the file into one of four possible
sections, with these four headers:
SCHEME TUPLE : // Gives E/R description of allowed edges
SCHEME ATTRIBUTE : // Gives names of allowed attributes
FACT TUPLE : // Gives edges in actual graph
FACT ATTRIBUTE : // Gives attributes in actual graph
In the usual case, a scheme, described by
its scheme tuples and its scheme attributes, would be used for more than one
actual (fact level) graph.
In our example, we would add the
following to our "fact" level TA program (to a FACT TUPLE section):
$INSTANCE P Proc
$INSTANCE Q Proc
$INSTANCE V Var
This means that P and Q are in the Proc
class of entities and V is in the Var class of entities. We can consider that
these are "declarations" for P, Q and V.
We have given an example of the Tuple
sub-language at the scheme level (as in "Call Proc Proc"). We will
now give an example of the Attribute sub-language at the scheme level. In our
example, procedures (Proc's) can have six attributes: x, y, width, height,
color and description. We can record this scheme information as follows:
Proc { x y width height
color = red
description }
We have "declared" that a Proc
entity can have these six attributes. We have also "declared" that
there is a default value of the color attribute, which is "red". This
means that if any Proc entity is not explicitly given a color attribute set, it
is taken to be red.
Rules for attributes values. First,
consider attributes given in the FACT section. Such an attribute must have an
explicit value, so “color = blue” is acceptable, but simply “color” is not
acceptable. Such a fact attribute can
be given only if there is a corresponding attribute in the SCHEME section. Second, consider attributes written
in the SCHEME section. It is optional
whether a value is given for such an attribute, so in the scheme, either
“color” or “color = red” is acceptable. If the value is not given, the
attribute signifies that a value can be given (but is not required to be given)
for corresponding attributes in the FACT section. If the scheme gives a value, as in “color = red”, this provides
the default value for corresponding fact attributes. This default can be overridden by explicitly providing a value in
the FACT section.
Aside. E/R Diagrams are richer than TA's schemes in that E/R
Diagrams support n-ary relations while TA supports only binary relations such
as Call and Ref.
Among entity classes (and relations)
there may be many common characteristics. In our example, all entities have the
same attributes, whether they were Proc's or Var's. Similarly, the two edge
classes (relations) both have a color attribute.
We can use inheritance to avoid
re-writing such common attributes, as is done here at the scheme level:
$INHERIT Proc ProgItem
$INHERIT Var ProgItem
This says that a Proc "is a" ProgItem, as well, Var "is a" ProgItem. We declare the common attributes of program items (ProgItem's) this way:
ProgItem { x y width height
color
description }
Proc { color = red
numberOfCalls = 0 }
Var { Color = green }
These attribute declarations allow
procedure and variable entities to have the six attributes of program items. As
well procedures have a default color of red and a numberOfCalls attribute that
is initially 0. Variables have a default color of green.
There is a predefined entity class called
$ENTITY which is the inheritance root for entity classes; in other words, all
entity classes implicitly inherit from $ENTITY. Instead of using ProgItem we
could have used $ENTITY with the same effect, as in the following way:
$ENTITY { x y width height
color
description }
Proc { color = red
numberOfCalls = 0 }
Var { Color = green }
Since every entity class, such as Proc and Var, inherits from $ENTITY, we do not need to add lines such as "$INHERIT Proc $ENTITY".
There is a predefined relation called
$RELATION that is the inheritance root for relations; in other words, all
relations implicitly inherit from $RELATION. We can specify that all relations
have color and weight attributes as follows:
($RELATION) { color weight }
(Call) { color = black }
We have also specified that the default color for Call edges is black.
In a typed graph, there is at most one
edge of a given type between two nodes. If we wish to represent more than one
edge of a given type, e.g., more than one call from P to Q, we can use an
attribute such as "weight" to record this information. For example,
the following can record the fact that there are three calls from procedure P
to procedure Q:
(Call P Q) { weight = 3 }
When one entity class inherits from
another, this has two affects. First, the inheriting class gets all the
attributes of the inherited-from class. Second, any connectivity allowed for
the inherited-from entity class is also allowed by the inheriting class. For
example, this line
partOf progItem progItem
specifies that any progItem (any Proc or Var) can be "part of" any other progItem.
As another example,
$RELATION $ENTITY $ENTITY
allows any entity (every entity descends from $ENTITY) to be connected by any relation (every relation descends from $RELATION) to any other entity.
In TA, inheritance is a notational
convenience, but it adds no new power beyond what was in TA without inheritance.
It simply allows us to write less by factoring common scheme information into
an inheritance ancestor. It is always possible to write out the connectivity
and attributes explicitly rather than using inheritance and get the same
effect.
The only possible incompatibility from
combining attributes of two inherited attributes would be if two attributes had
the same name and one named a nested set of attributes and the other named a
scalar attribute; this problem and its solution is discussed in the section on
attribute nesting and attribute lists.
Multiple inheritance is allowed. For
example, a Proc could inherit from both a Scope and a Unit. Multiple
inheritance combines the attributes from its various inherited-from classes.
Two inheritances of the same attribute, say color, are taken to be a single
attribute of that name. The only possible incompatibility when combining
attributes is the same as the problem previously described, in which an item's
attributes are given several times.
There are two languages that make up TA,
the Tuple and the Attribute sub-languages. As well, there are two levels of
each of these, the fact level and the scheme level. We illustrate this idea in
this diagram.
Tuple Attribute
+---------------+----------------+
| | |
Scheme | Allowed | Allowed |
| edges | attributes |
| | |
+---------------+----------------+
| | |
Fact | Actual | Actual |
| edges | attributes |
| | |
+---------------+----------------+
The syntax used for scheme tuples and for
fact tuples is identical, and the syntax used for scheme attributes and for
fact attributes is essentially identical. At the scheme level we
"declare" each entity class and each relation as well as the
attributes allowed for each of these. At the fact level, we
"populate" a graph, by giving the graph's nodes, edges and
attributes.
The purpose of a scheme is to capture
certain invariant structural properties of a fact-level graph (or set of
graphs). For example, it is handy to state, in a single place, the kinds of
nodes and edges in a graph, its various attributes and the allowed connectivity
among nodes. In turn, it is helpful to be able to test to see if a (fact level)
graph actually conforms to its scheme. If not, presumably there is something
wrong with the graph (or possibly with the scheme). The scheme provides a certain amount of redundancy that can help us
find errors in particular graphs. Put another way, a scheme encodes a set of
constraints that we expect our graph(s) to satisfy. Yet another way to put this
is that a scheme provides a set of declarations for a graph, and the graph
itself makes assignments (creates structure) to populate the declarations.
If the fact level graph F corresponds to
a particular scheme S, we say that F "conforms" to S (or satisfies S,
or is legal according to S). Sometimes there is a commonly used scheme S, in
which case we say simply F is conformant, or legal, without mentioning S.
In a standard relational database (an
RDB), there is a scheme and the corresponding (instance) database contains
tuples that corresponding to the scheme. In the usual implementations of RDBs,
it is not possible to add a tuple to the database unless the tuple corresponds
to the scheme. This in turn implies that the database can never fail to conform
to its scheme.
Things are not this simple in TA. In a TA
program, the programmer can write anything he or she feels like. If he violates
the TA's syntax, then technically, the program has no meaning at all. If the
program satisfies TA's syntax, then the program may or may not be conformant.
If it is, this situation is like that in an RDB, namely, the fact level graph
actually satisfies its scheme. If not, we can still consider that the fact
level graph exists, as does the scheme, even though the fact level graph
"breaks the rules" of its scheme. We can repair (or write a program
to repair) a program so it satisfies its scheme, by deleting violating
constructs.
Note: There is a "schema
checker", called chkschema, written in Grok, which checks to see if the
fact part of a TA program conforms to its schema.
A scheme defines those relations, nodes
and attributes that are allowed in a corresponding fact graph. A scheme is
usually given explicitly in terms of the TA language. However, in principle, a
scheme is optional in that the Universal Scheme can be used. The Universal
Scheme assumes:
(a) Any node can be attached to any other
node. This is equivalent to this tuple in an explicit scheme:
$RELATION $ENTITY $ENTITY
(b) Any name of a node that appears in a
fact tuple, such as A and B in "Rel A B", is implicitly considered to
be an instance of $ENTITY, as if the following occurred explicitly as a tuple
in a scheme:
$INSTANCE A $ENTITY
$INSTANCE B $ENTITY
(c) Any attribute is allowed for any node
or edge or relation. This is as if the corresponding attribute were given
explicitly in a scheme attribute.
The value of an attribute can either be a
string, such as green in
P { color = green }
or a value list as in
P { color = ( 0.5 0.2 0.4 ) }
A value list is a sequence of zero or
more values (strings) surrounded by parentheses ( … ). Note that lists of lists
are allowed.
The original version of TA specified that
attributes could be nested, but this feature has been eliminated from TA.
The grammar for TA is given below. The
notation {...} means zero or more repetitions of the enclosed items.
TALanguage ::= {section}
section ::=
SCHEME TUPLE : tupleLanguage
| SCHEME ATTRIBUTE : attributeLanguage
| FACT TUPLE : tupleLanguage
| FACT ATTRIBUTE : attributeLanguage
tupleLanguage ::= { stringToken stringToken stringToken }
attributeLanguage ::=
{ itemId "{" {attributeSetting} "}" }
attributeSetting ::=
attributeId
| attributeId "=" attributeValue
attributeValue ::=
stringToken
| "(" {stringToken} ")"
itemId ::= stringToken // Fact entity or, in scheme, entity class
| "(" stringToken ")" // Relation (edge class)
| "(" stringToken stringToken stringToken ")" // Actual edge
Note: Nested attributes have been deleted
from TA.
These stringTokens have predefined
meanings: $ENTITY, $RELATION, $INSTANCE, and $INHERIT. ($NOTE no longer has any
meaning.)
In TA, string tokens (stringTokens) may
be quoted, or not, depending on their contents. For example, in the following:
P { color = red }
the string token with value red is written without quotes. This could as well be written as
P { color = "red" }
with the same meaning. This can also be written as:
P { color = 'red' }
As these examples illustrate, single or double quotation marks can be used to enclose the contents of string tokens.
In string tokens, certain characters,
such as newline characters, must be escaped, using a backslash character (\).
Here is the list of these characters:
CHARACTER ESCAPE CONVENTION
newline (10) \n
tab (9) \t
formfeed (12) \f
return (13) \r
eject (27) \e
delete (127) \d
backslash (92) \\
single quote (39) \'
(escaping not necessary in double quoted strings)
double quote (34) \"
(escaping not necessary in single quoted strings)
Single quotes (') do not need to be
escaped within strings surrounded by double quotes (") and vice versa.
In TA, string tokens containing the
following must be quoted:
Blank character
Any of the above escape characters
( Left parenthesis
) Right parenthesis
{ Left brace
} Right brace
= Equal sign
' Single quote in a double quoted string
" Double quote in a single quoted string
String tokens in TA must contain only
printable characters (and the above escaped characters). As a result, TA token
strings cannot contain all the possible 8-bit patterns of bytes.
String tokens may have their lengths
limited by some tools; for example, relation names, entity class names, entity
names and attributes names may be limited to 255 characters. (Grok has this
restriction.) Tools that use or accept attributes values are expected to handle
attribute string values that are very long, e.g., thousands of characters long.
Some tools, notably LSEDIT, use the
following convention. Several fact assignments are allowed to a given
attribute, with the final assignment being the effective one (previous ones
being effectively ignored). Note that there is no way in TA to add to an
already existing list (although of course a particular tool could read a TA
program and modify it such that new items are effectively added to a list).
There is a "include" directive,
so you can build up a TA program out of a set of files. The format of this
directive is:
INCLUDE nameOfIncludedFileInQuotes :
LSEDIT makes effective use of
"includes" to separate out various parts of TA, such as the schema,
layout information and descriptions, which are attributes.
The LSEDIT program reads TA and draws the corresponding graph. LSEDIT requires certain ordering in the TA. In brief, it requires DBU (declaration before use). More specifically, it requires the following ordering:
· (1) scheme before facts,
· (2) inheritance before schema relations,
· (3) inheritance of entities before their scheme attributes,
· (4) instance information before using a fact entity.
[Chen 77] The Entity-Relationship
Approach to Database Design, by Peter Chen, QED Information Sciences Inc., 170
Linden St., Wellesley, Mass., 1977, 1991
[Holt 1998] Structural Manipulations
of Software Architecture using Tarski Relational Algebra, R. C. Holt, WCRE
'98: Working Conference on Reverse Engineering, Honolulu, Oct 1998.
[Malton 01] Union
Schemas as the Basis for a C++ Extractor, Thomas Dean, Andrew Malton,
Richard Holt, WCRE 2001: Working Conference on Reverse Engineering, Stuttgart,
Germany, Oct 2-5, 2001.
[Mancoridis 95] Extending Programming
Environments to Support Architectural Design, S. Mancoridis and R. C. Holt,
CASE '95: Workshop on Computer-Aided Software Engineering, Toronto, Canada,
July, 1995.
[Muller 93] A Reverse Engineering
Approach to Subsystem Identification, by Hausi A. Muller, O.A. Mehmet, S.R.
Tilley, and J.S. Uhl, Software Maintenance and Practice, Vol 5, 181-204, 1993.
[Tarski 55] On the Calculus of Relations,
by Tarski, A., J. Symb. Log. 6, 3, 1941, 73-89.
[Ullman 88] Principles of Database and
Knowledge-Base Systems (Volume I), by J. D. Ullman, Computer Science Press, New
York, New York, 1988.