Ric Holt, 5 May 2002
Abstract. This note introduces Grok, a notation for manipulating binary relations. Grok has an interpreter, which can be considered to be a relational calculator. This interpreter has been used extensively for analyzing factbases produced by parsers that extract information from source programs.
Background and Overview. The initial version of Grok was created by the author in 1995. It has evolved to become a language for manipulating factbases. Grok operates at the level of a relational database, in that operators generally apply across entire relations, and not just to single entities. The Grok interpreter has been optimized to handle large factbases (up to several hundred thousands of facts). It keeps all of its data structures in memory. It is written in the Turing language.
The term grok means to understand, which is appropriate in that the Grok language is used to help understand large-scale software. This term was coined in the book Stranger in a Strange Land, by Robert A. Heinlein.
This report consists of five parts. Part 1 introduces Grok’s set and relation operators using an example involving Garfield the cat. These operators are summarized in Part 2.
Part 3 shows how Grok can be used to interactively query factbases. It uses an example to introduce schemas (also called schemes) that characterize the form of factbases. The reader may want to consult the related report on TA (the Tuple Attribute languages) when reading this part. Part 4 shows how scripts written in Grok can be used to query and transform factbases. Finally, Part 5 summarizes the features of Grok.
Part 1. Introduction to Grok’s Operators
Part 2. Summary of Grok’s Operators
Part 3. Grok Queries, Factbases and Schemas
Part 4. Writing Grok Scripts
Part 5. Summary of Grok Features
We will introduce Grok’s set and relation operators using an example in which there are two cats (Garfield and Fluffy), two mice (Mickey and Nancy) and two pieces of cheese (Roquefort and Swiss). We will begin by defining sets for these cats, mice and cheese.
Set constructors. In the Grok language we write
cat := {“Garfield”, “Fluffy”}
to construct the set variable named cat containing the strings Garfield and Fluffy. Similarly, we can create set variables mouse and cheese:
mouse := {“Mickey”, “Nancy”}
cheese := {“Roquefort”, “Swiss”}
Set union, intersection and subtraction. We can combine sets. For example, the Grok statement
animal := cat + mouse
creates the set called animal which is the union of the cat and mouse sets. As a result, animal contains the four strings Garfield, Fluffy, Mickey and Nancy. Similarly, we can take the union of the mouse and cheese sets to create the set of items that are food:
food := mouse + cheese
As a result, the food set will contain Mickey, Nancy, Swiss and Roquefort.
Intersecting animals with food, we can find the set of animals that are food:
animalsWhichAreFood := animals ^ food
These animals are Mickey and Nancy. We can use set subtraction to find those animals, which are not food:
animalsWhichAreNotFood := animals – food
These animals are Garfield and Fluffy.
Set cardinality. We can ask how many items are food:
howManyThingsAreFood := # food
This creates a new scalar (string) variable called howManyThingsAreFood, which is set to 4. This statement uses the cardinality operator # to return the size of the set
{“Mickey”, “Nancy”, “Swiss”, “Roquefort”}.
Command line statements. When Grok is used as a command line interpreter, you can directly enter statements, such as the example Grok statements we have given so far. You can type an expression or value, and the Grok interpreter will evaluate it and print it. For example, if you have already entered the above statements and you type
mouse
then Grok will print back to you
Mickey
Nancy
Sets are not ordered. Since ordering is immaterial in sets, we do not know if Grok will output Mickey before or after Nancy.
Script language. Grok is also a script language, so you can write a Grok program, consisting of Grok statements, to be executed by the Grok interpreter.
Set comparisons. We can compare sets, for example:
mouse <= food
This determines whether mouse is a subset of food. Both elements of mouse (Mickey and Nancy), are elements of food, so the Grok interpreter will output true.
The Grok set comparison operators are given in Figure 1.1.
== |
equal, which can also be written as a single equal sign |
~= |
not equal |
<= |
subset |
>= |
superset |
< |
strict subset |
> |
strict superset |
Figure 1.1. Grok Comparison Operators
Summary of set operators. This section has illustrated how sets are constructed using the notation {item1, item2, …}, how these sets can be combined using the union (+), intersection (^), and subtraction (-) operators, and how they can be compared, using <=, =, etc. As well, we introduced the cardinality operator (#).
We will now introduce Grok’s relations and relational operators. In Grok, a relation is a set of ordered pairs. For example, a relation might consist of the two pairs (Garfield, Fluffy) and (Mickey, Nancy). This relation might be the love relation, signifying that Garfield loves Fluffy and Mickey loves Nancy.
Cross product. We can use the cross product operator X (capital letter X) to create relations, for example:
chase := cat X mouse
This can be considered to mean that cats chase mice. As defined by this statement, the chase relation consists of these four pairs
Garfield Mickey
Garfield Nancy
Fluffy Mickey
Fluffy Nancy
In other words, both cats chase both mice. In general, the cross product of sets S and T produces every pair (x, y) such that x is in S and y is in T.
We can create the eat relation this way
eat := chase + mouse X cheese
This creates the eat relation to be the same as chase, but with the addition (union) of the cross product of mice and cheese. As a result, the eat relation will contain these eight pairs:
Garfield Mickey
Garfield Nancy
Fluffy Mickey
Fluffy Nancy
Mickey Swiss
Mickey Roquefort
Nancy Swiss
Nancy Roquefort
We can interpret this to mean that cats eat mice and mice eat cheese.
Operator precedence. Grok uses the traditional precedence of operators from mathematics, namely, additions (as well unions and subtractions) are done after multiplications (as well as intersections and cross products). For example, in the statement eat := chase + mouse X cheese, the cross product X is done before union (+).
Relation cardinality. We can use the cardinality operator # to ask how many pairs are in a relation:
howManyPairs := # chase
This will create the string variable howManyPairs and will assign it the value 4.
RSF. We sometimes represent the facts of a collected set of relations as triples of the form (R, x, y), where relation R contains the pair (x, y). For example, the chase and eat relations, taken together, can be represented this way:
chase Garfield Mickey
chase Garfield Nancy
chase Fluffy Mickey
chase Fluffy Nancy
eat Garfield Mickey
eat Garfield Nancy
eat Fluffy Mickey
eat Fluffy Nancy
eat Mickey Swiss
eat Mickey Roquefort
eat Nancy Swiss
eat Nancy Roquefort
This list notation is called RSF (Rigi Standard Format), because it was developed for use in the Rigi reverse engineering tool. We sometimes call a triple a fact, and we sometimes call these lists factbases. Ordering is immaterial in relations, so we can change the ordering of the lines in this list without changing the meaning of the list.
Set
projectors. We can use the set
projector operator, written as a dot,
to determine how a set is related to another set. For example, we can determine what Mickey
eats, by asking the question: How is the set {“Mickey”} related, via eat,
to another set:
mickeyEatsWhat := {“Mickey”} . eat
This statement assigns mickeyEatsWhat the set containing Swiss and Roquefort. Similarly, we can ask what eats Mickey:
whatEatsMickey := eat . {“Mickey”}
This statement assigns whatEatsMickey the set containing Garfield and Fluffy. In the general case, when s is a set and R is a relation, the form s . R produces the set t such that if s contains x and R contains (x, y) then t contains y. The form R . s produces the set t such that if R contains (x, y) and s contains y then t contains x.
Domains and ranges. The dom (domain) of a relation is the set of all beginnings (left ends) of pairs in a relation. For example
eater := dom eat
This statement assigns eater the set of things that eat. This set consists of Garfield, Fluffy, Mickey and Nancy. The statement
food := rng eat
assigns food the set of things that are eaten. This set consists of Mickey, Nancy, Swiss and Roquefort.
The sink of a relation R is the set of items that are in R’s range but not in its domain. We can compute the sink for the eat relation as follows:
bottomOfFoodChain := rng eat – dom eat
This assigns bottomOfFoodChain the set containing Swiss and Roquefort. The source of a relation R is the set of items that are in the relation’s domain, but not in its range, as in:
topOfFoodChain := dom eat – rng eat
This assigns topOfFoodChain the set containing Garfield and Fluffy.
Relation union, intersection and subtraction. There are a number of Grok operators that combine relations. For example, we can intersect the eat and chase relations to tell us about both eating and chasing:
bothEatAndChase := eat ^ chase
This statement assigns bothEatAndChase this set of pairs:
Garfield Mickey
Garfield Nancy
Fluffy Mickey
Fluffy Nancy
This means that Garfield both chases and eats Mickey, that Garfield both eats and chases Nancy, etc. Relation intersection (^), returns the sets of pairs that are common to two relations. Analogously, relation union (+), returns all pairs that exist in either of the two relations.
Subtraction (-) is another relation operator, for example,
eatButNotChase := eat – chase
This assigns eatButNotChase those pairs in eat that are not in chase. As a result, eatButNotChase records that fact that mice eat but do not chase cheese.
Similarly, we can determine the relation for chasing but not eating:
chaseButNotEat := chase – eat
It turns out that this is the empty relation, because in our example, all chased things (the mice) are eaten (by the cats).
Empty relation. We could as well have written
chaseButNotEat := EMPTYREL
EMPTYREL is Grok notation for the relation containing no pairs. Another way to accomplish the same thing is to write the cross product of two empty sets:
ChaseButNotEat := { } X { }
Relation comparisons. It turns out that the relation bothEatAndChase and chase are identical, which we could check using comparison of relations, by writing:
bothEatAndChase == chase
This will return the value true. We can compare relations using the comparison operators that apply to sets, namely, ==, ~=, >=, <=. > and <. Each relation is actually a set of pairs, and these operators have the same meanings as they do for proper sets.
Relation composition and transitive closure. We will now introduce the composition operator (o). Relation R1 composed with relation R2 produces a new relation by a join in which each result pair is formed by attaching the right end of a pair in R1 to the left end of a pair in R2. Consider this example,
secondOrderEat := eat o eat
As a result, secondOrderEat
will contain pairs such as (Garfield, Swiss) because Garfield
eats something which eats Swiss.
In general, if R1 contains (x, y) and R2 contains (y,
z), then the result of R1 o R2 will contain (x, z)
The transitive closure of a relation R composes R with itself repeatedly, for example:
anyOrderEat := eat +
We can think of the result as “eats at any level of indirection”. The relation anyOrderEat will contain the pair that are in eat, such as (Garfield, Mickey) and (Mickey, Swiss), as well as second order edges such as (Garfield, Swiss). In this example, there are no higher level indirections of eating, due to the simple nature of our eat relation.
We can also compute the reflexive transitive closure of a relation R, written as R*, which is the same as transitive closure, with the addition of self loops, (x, x), on all items in the factbase.
Set id (identity) operator. A set of self loops, on a particular set of nodes, can be computed this way:
isMouse := id (mouse)
This assigns isMouse this set of edges:
Mickey Mickey
Nancy Nancy
In general, if set s contains x, then id s contains (x, x).
The id (identity) operator, and its self loops, are handy for restricting a relation to start or to end with a given set of items. For example, we can select the pairs in the eat relation that are eaten by mice as follows:
eatByMouse:= id (mouse) o eat
The resulting eatByMouse relation contains only the pairs corresponding to eating by mice. This composition (o) is computed by starting with the self loops in id (mouse). This guarantees that each result pair starts with a mouse. These self loops are “joined” with pairs from eat. The resulting pairs corresponding to eating by mice.
The identity of all nodes in the current Grok factbase is written as ID.
Relation inverse. Another relation operation is the inverse (or reverse) operator, for example:
chasedBy := inv chase
For each pair (x, y) in chase, this produces the pair (y, x) in chasedBy. So if Garfield chases Mickey then Mickey is chasedBy Garfield.
This section summarizes the Grok operators that were introduced using the example involving Garfield et al.
Union, intersection, subtraction, comparison and cardinality. Using the example of the cats, mice and cheese, we have introduced the basic set and relation operators of Grok. These include the following five operations given in Figure 2.1.
Operator |
Form |
Union |
x + y |
Intersection |
x ^ y |
Subtraction |
x - y |
Comparison |
x = y, x <= y, … |
Cardinality |
# x |
Figure 2.1. Operators used with sets and with relations
The two operands (x and y in above examples) of union, intersection, subtraction and comparison must both be sets or else both be relations.
Set and relation constructors. Sets can be created using a set constructor, for example:
{ x, y, z }
The empty set can be written as
{ }
or as
EMPTYSET
Relations can be created using cross products, for example:
{ a, b } X { x, y, z }
The result of the cross
produce of sets s and t, s
X t, contains all pairs (x, y) such that x is in s and
y is in t.
A collection of relations can be represented in RSF format, i.e., as a list of triples. RSF format is used for files, but cannot be used in a Grok command.
The empty relation can be written as
EMPTYREL
or as
{} X {}
Domain and range. The domain (set of initial nodes) and range (set of final nodes) of relation R are written as:
dom R
rng R
Set
projectors. A
set s can be combined with relation R in a set projector using
the form
s . R
or
R . s
The
form s . R produces the set t such that if s contains x
and R contains (x, y) then t contains y. The form R . s produces the set t
such that if R contains (x, y) and s contains y
then t contains x.
Identity
relations. The identity
relation on set s is written as:
id s
If
s contains node x then id s contains the pair (x, x). The identity relation over all nodes in
relations in Grok’s current factbase is written as:
ID
Relation
inverse. The
inverse (or reverse) of relation R is written as
inv R
If
R contains pair (x, y) then its inverse contains (y, x).
Relation
composition. The
composition of relations R1 and R2 is written as:
R1 o R2
If
R1 contains the pair (x, y) and R2 contains the pair (y,
z), then their composition is a relation that contains the pair (x, z). (Composition is much like a “join” in a
conventional relational database.)
Transitive
closure. The (non-reflexive) transitive closure of relation R
is written as:
R +
The
transitive closure consists the union of R composed with itself any number of times:
R + = R +
(R o R) + (R o R o R) + …
The
symmetric transitive closure is written as
R *
The
value of R* is the same as R+ with the addition of self loops
(ID):
R * = ID + R+
This section shows how the Grok interpreter can be used as an interactive query engine. Before doing this, we will introduce an example C program and will show how facts about that program can be represented as a factbase. We will then give example Grok queries that probe these facts.
We will use a very simple C program,
called EG, as an example. Our C program consists of three files: main.c,
putint.c and putint.h. The first, main.c, sets a global
variable to 5 and calls the putint function with a parameter of 10. The putint.h
file declares a global variable to be an integer (as external) and gives the
header of the putint function (as external). The putint.c program
gives the body of the putint function, which computes the product of its
parameter and the global variable and prints the result. These files are listed
in Figure 5.1.
File main.c:
#include "putint.h"
void main(void){
global_var=5;
putint(10);
}
File putint.h:
#ifndef _PUTINT_H_
#define _PUTINT_H_
extern int global_var;
extern int putint(int param);
#endif
File putint.c:
#include "putint.h"
int global_var;
int putint(int param) {
int local_var;
local_var=global_var*param;
printf("The number is:%d\n",local_var);
}
Figure 5.1. EG: A Very Simple C Program
Facts Extracted from EG
The PBS project, at the Universities of Toronto and Waterloo, developed a special purpose version of the GCC compiler front end called CFX (C Fact Extractor), which translates C programs into graphs. In the case of the EG program the translation produces the graph as the stream of triples shown in Figure 5.2. (To be technically correct, we should say that CFX produces tables, which are translated into the stream by a program called fbgen. To keep things simple, we will consider that CFX directly produces the stream.) CFX stores this output stream as RSF triples in a file called factbase.rsf.
funcdef main.c main
defloc main main.c:3
include main.c putint.h
linkcall main putint
linkref main global_var
macrodef *Initialization* __GCC_NEW_VARARGS__
macrodef *Initialization* __GNUC_MINOR__
macrodef *Initialization* __GNUC__
macrodef *Initialization* __sparc
macrodef *Initialization* __sparc__
macrodef *Initialization* __sun
macrodef *Initialization* __sun__
macrodef *Initialization* __unix
macrodef *Initialization* __unix__
macrodef *Initialization* sparc
macrodef *Initialization* sun
macrodef *Initialization* unix
funcdcl putint.h putint
dclloc putint putint.h:5
vardcl putint.h global_var
vardclloc global_var putint.h:4
macrodef putint.h PUTINT_H
funcdef putint.c putint
defloc putint putint.c:4
vardef putint.c global_var
vardefloc global_var putint.c:3
include putint.c putint.h
include putint.c putint.h
librarycall
putint printf
Figure 5.2. The EG Program, as Translated by CFX to RSF Triples
As a reader, you do not need to concern yourself with the details of this stream. Our purpose in giving this stream is to illustrate the kinds of streams that Grok deals with. In this stream, the triple
funcdcl
F p
means file F contains a function declaration (function header) for p. The triple
funcdef F p
means file F contains a function definition (function body) for p.
CFX Schema
Figure 5.3 gives the list of the relations produced by CFX, in the form of a TA schema, with brief explanations of each relation. (Aside: in the last part of the schema, the string file:lineNum has been placed in quotes. This is done because the Grok interpreter puts quotes around strings containing colons.)
SCHEME TUPLE :
funcdef file function //
function is defined in file
funcdcl file function //
function declared in file
funcdcl file libraryfunction // libraryfunction declared in file
vardef file variable // var
defined in file
vardcl file variable // var
declared in file
macrodef file
macro // macro defined in file
typedef file type //
type defined in file
structdef file
struct // struct defined
in file
uniondef file
union // union defined
in file
enumdef file enum // enum defined in file
include file file //
file 1 includes file 2
sourcecall function
function // function 1
calls function 2
// (
function 2 definition is
// in same compilation unit
// OR
// is included in same compilation
// unit as function 1 )
sourceref function
variable // function
references variable
// ( variable definition is
// in same compilation unit
// OR
// is included in same compilation
// unit as variable )
linkcall function
function // function 1
calls function 2
// ( function 2's definition
// must be linked to )
linkref function variable //
function references variable
// ( variable definition must
//
be linked to )
usemacro file
macro // file uses
macro
usetype file type //
file uses type
usestruct file
struct // file uses
struct
useunion file
union // file uses union
useenum file enum //
file uses enum
librarycall function
libraryfunction // function calls libraryfunction
// ( a libraryfunction is a func
// whose definition does not
// exist in the analysed code )
//
Attribute tuples extracted by CFX
defloc function “file:lineNum” //
func definition location
dclloc function “file:lineNum” //
func declaration location
deflocend function
“file:lineNum” // func
definition end location
vardefloc variable
“file:lineNum” // var
definition location
vardclloc variable
“file:lineNum” // var
declaration location
macrodefloc macro
“file:lineNum” // macro
definition location
typedefloc type
“file:lineNum” // type
definition location
enumdefloc enum
“file:lineNum” // enum
definition location
structdefloc
struct “file:lineNum” // struct definition location
uniondefloc union
“file:lineNum” // union
definition location
enumdeflocend enum
“file:lineNum” // enum
definition end location
structdeflocend
struct “file:lineNum” // struct definition end location
uniondeflocend union
“file:lineNum” // union
definition end location
Figure 5.3. The CFX Schema (Also Called the PBS Low Level Schema)
The notation for the schema given in Figure 5.3 has, almost, the form of a TA scheme (schema). However, the form of this schema is a bit dated (CFX and this schema were created before TA was defined), in that the attributes are written as relations. For example, this line in the schema
defloc function “file:lineNum”
would be expected to be written in TA as
function { defloc = “file:lineNum” }
However, we can treat attributes as relations with no harm done, so we will not be concerned with this difference.
CFX Schema as an Entity/Relation Diagram
We can draw a schema as a diagram. For example, the CFX schema is shown as a diagram in Figure 5.4. This diagram has the form (almost) of an Entity/Relation Diagram; E/R Diagrams are commonly used in the design of relational databases.
Figure 5.4. The CFX Schema as a Diagram
Querying EG’s CFX Factbase
We can use the stream shown in Figure 5.2 as the basis for answering questions about the EG program. In particular, we can use Grok to ask queries about the EG program. To do this, we can start up the Grok interpreter (the command for Grok is grok or grok.x). Our first command to Grok is:
getdb “factbase.rsf”
This reads the stream into Grok’s internal memory. (Aside: In interactive mode, Grok is forgiving of syntactic details, so when we are typing this interactively, we can omit the quotation marks and type simply getdb factbase.rsf.)
We can ask Grok to list the entire database by typing
listdb
When we do this, Grok produces a list of triples much like that in Figure 5.2. However, its ordering of triples may be different, because the order of relations is immaterial to Grok.
We can determine the number of triples by typing
dbsize
Grok will output the number 29, which is the count of the triples in the list.
For unknown reasons, the CFX extractor sometimes produces repeated copies of triples. In Figure 5.2, the factbase.rsf stream contains two copies of the line
include putint.c putint.h
which
appear near the bottom of the list. We
can delete duplicate lines by typing
deldups
After we have typed deldups, when we again type dbsize, Grok will tell us that there are only 28 triples, because the one of the duplicated copies was deleted.
Perhaps we want to check to see what includes were in the EG program. To do this, we can type
include
This asks Grok to list all the pairs in the include relation, so it outputs:
main.c putint.h
putint.c putint.h
These two pairs tell us that both main.c and putint.c include putint.h.
We can ask the question: What files does main.c include, by typing:
main.c . include
This form is a forward set projector. As a result, Grok will type
putint.h
In this example, we have cut down on the amount of we needed to type by taking advantage of Grok’s forgiving nature. Grok’s set projectors actually expect to receive a set value. To be perfectly correct, and to avoid Grok error messages, we should really have written:
{ “main.c” } . include
But when we are working interactively, we often use these shortcuts.
We can determine what files include putint.h by typing:
include . putint.h
This form is a backward set projector, and again we are minimizing the amount we need to type. As a result, Grok will type
main.c
putint.c
We might want to know more about calls in the EG program. We could start out by asking the names of all relations, by typing
relnames
As a result, Grok will output the list of names of relations:
funcdef
defloc
include
linkcall
linkref
macrodef
funcdcl
dclloc
vardcl
vardclloc
vardef
vardefloc
librarycall
From this list, we can surmise that the fact base contains two kinds of calls: linkcall and librarycall. From the schema (Figure 5.3) we know that a linkcall is a call that is linked from one user compilation to another, while a librarycall is a call to a library function. To inspect all linkcalls, we type
linkcall
As
a result Grok outputs
main putint
To
find out about library calls, we type
librarycall
and Grok outputs
putint printf
We have determined that the only linked function call is from main to putint, and the only call to a library is from putint to printf.
We may want to know what things putint is related to. To find out, we type
putint
:
It is helpful to think of the colon operator (:) as two eyeballs, looking forward from putint to see what it is related to. As a result, Grok will output:
dclloc putint.h:5
defloc putint.c:4
librarycall printf
From looking at the comments in the CFX schema, we can surmise from this that putint is declared on line 5 of putint.h and defined on line 4 of putint.c. We can also re-affirm that putint calls the library function printf.
We might want to know what depends on the putint function. To find out we type
:
putint
It is helpful to think of this as two eyeballs looking at putint. As a result, Grok will output:
main linkcall
putint.h funcdcl
putint.c funcdef
This means that function main calls putint. It also means that putint is declared in putint.h and is defined in putint.c.
As our next example, we will consider a larger program. This program is a compiler, written in C, for a Pascal-like language. It has a traditional compiler structure, with a scanner, a parser, a semantics pass and a code generator. We have run C488 through CFX and have obtained the factbase for it. This factbase, factbase.rsf, for C488 is much larger than the factbase for EG. Further, we have run this factbase though the PBS pipeline and as a result, we have created a landscape (structural diagram) for it. You may want to view it on the web by this URL:
http://swag.uwaterloo.ca/pbs/examples/C488/V1/index.html
If you navigate around through this landscape, you will discover that the C488 compiler consists of a number of files, such as parser.c, parser.h, codgen.c, codegen.h, globalvars.h, tokendef.h, etc. You can observe that the only relations shown in the diagram are call, reference, implementby and otherdep. As you may recall, the CFX schema has many more relations than this. Why does the diagram have fewer and different relations? The answer is that a number of Grok scripts manipulated the factbase, combining and renaming relations, to make the diagrams easier to comprehend. Another thing to notice is that the diagram contains higher level entities, such as the semantics subsystem, which group together a number of files. In particular, the semantics subsystem groups these files: semantics.h, semantics.c and semanticext.h.
When landscape diagrams are inspected, it is common to observe surprising dependencies. For example, in the C488 top view, there is a function call from the codegen subsystem to the semantics subsystem. Why should the code generator call the semantic analysis part? It seems that semantic analysis should be completed before code generation starts, so such a call would be wrong or maybe just peculiar. We want to know what actual function call(s) are causing this high level interaction from the code generator to semantics.
Since this kind of question happens repeatedly, we will give it a name: the Where-Did-This-Arrow-Come-From problem. Ideally, we would just press a button on the landscape diagram display, and we would get the answer. The reason there is no button is that there has been considerable abstraction and simplification from the source program to the CFX factbase to the stream used as a basis for drawing the landscape diagrams. However, by using the factbase and Grok, we can work our way backward to answer this question.
The Where-Did-This-Arrow-Come-From
Problem
We will consider the following question.
Q1: Where did this arrow from subsystem S to subsystem T come from?
As an example, we will consider the call arrow from S, the code generator subsystem, to T, the semantics subsystem. We will answer this question in two steps.
Step 1. Using the landscape diagram, we will partly answer the question by finding out what files in system S call what files in subsystem T. Given that we have discovered that file F in S calls file G in T, we are now ready for the easier question:
Q2:
Where did the arrow from file F to file G come from?
Step 2. We will answer question Q2 by using Grok to query the program’s CFX factbase.
Aside: This example uses a factbase created by the CFX fact extractor. If we were using a different fact extractor, such as CPPX, we would need to modify our approach, but the same general method would apply.
In Step 1, we can use the diagrammatic querying facilities of the landscape drawing program to file out what file(s) in the code generator call what file(s) in semantic analysis. By concentrating first on the codegen subsystem, we can look into its internal diagram to determine that the only file in it that makes calls is codegen.c. We can conclude that F (the file in S) is codegen.c. Then we can look inside the semantics subsystem, where we can immediately discover that the only file that is called is semantics.c. So we can conclude that G (the file in T) is semantics.c. In a more complex example, we would find more than one file F that calls more than one file G. Also, in a more complex example, we may have to spend some time diving deeper into the landscape diagrams to locate the files that are causing the interaction. At any rate, once files F and G are isolated, using the diagrams, we are ready to go to step 2.
To carry out Step 2, we use the program’s factbase.rsf file, which was created by CFX. This factbase contains interactions at the level of functions and global variables. After starting up Grok, we type this command to load the factbase:
getdb
factbase.rsf
Once this is loaded, it is a good idea to check its size (using dbsize) and to take a look at the names of its relations (using relnames), just to gain confidence that what we are inspecting makes sense. It is also important to understand the schema reasonably well; it is recommended that you refer to the CFX schema (Figures 5.3 and 5.4). From the schema, we can tell that the funcdef relation records when file F defines function P. We can find the names of all functions declared in file codegen.c by typing:
P := codegen.c . funcdef
To check that F seems reasonable, we can find out how many functions are in it, by typing:
# P
When we do this, Grok tells us that P contains 21 items. Since P is not large, it is a good idea to list these items, by typing:
P
As a result, Grok outputs all the functions defined in codegen.c:
BBS_Pop
BBS_Push
Code
Code31
Code32
Code46
Code47
FBS_Pop
FBS_Push
FBS_Swap
LES_Pop
LES_Push
Pop_FuncName
Pop_arrayName
Push_FuncName
Push_arrayName
Return_Func_NumArgs
allocate_storage
codegenFinalize
codegenInitialize
generateCode
This list seems reasonable, i.e., it is not surprising that these functions are part of codegen.c.
We will next find all functions in semantics.c. In querying to find these functions, we will ask for all functions that are either declared or defined in semantics.c, by typing:
Q :=
semantics.c . (funcdef + funcdcl)
We
could have written simply
Q :=
semantics.c . funcdef
However, there is the possibility that semantics.c might contain a function header, without a corresponding body, and that the arrow we are trying to locate might be pointing to this header. So we are using the first (the longer) query. Once we have the value of Q, it is a good idea to see how big it is and to list the names of functions contained in it. We have illustrated these operations for P, so we won’t show them for Q.
Now we are ready to give a query that asks for all linkcalls from functions in codegen.c (in P) to functions in semantics.c (in Q). Here is the query:
id (P)
o linkcall o id
(Q)
This query makes use of the id function twice. The first id produces self loops on each member of P. These self loops are composed with linkcall so that the result includes only those pairs in linkcall which start with a function in P. These pairs are next composed with id (Q) so that the resulting relation contains only those pairs that end with a function in Q. The result contains all those link calls from a function in codegen.c to a function in semantics.c. As a result of this query, Grok outputs
generateCode ErrorHandler
From this we conclude
that there is only one function, generateCode, in codegen.c, which calls
a function in semantics.c. The
only such called function is ErrorHandler.
To find out more about the generateCode function, we can type
generateCode :
As a result, Grok outputs:
defloc codegen.c:365
deflocend codegen.c:869
linkref errorFile
linkref scanInteger
linkref traceCodeGen
linkref traceFile
linkcall ErrorHandler
linkref PrevScope
linkref ScopeLevel
librarycall fprintf
librarycall strlen
dclloc codegen.h:21
From this list, we can see that generateCode is declared (defloc) on line 21 of codegen.h and is defined (defloc) on line 365 of codegen.c We could make a similar query for ErrorHandler, but instead we will specialize the query to give us the locations of ErrorHandler’s declaration and definition, by typing:
ErrorHandler . (dclloc
+ defloc)
As a result of this query, Grok outputs:
"semantics.c:138"
This output is surprising, in that it indicates that the Error Handler does not have both a definition and a declaration. To find out more, we can limit our query to try to find only the declaration, by typing:
ErrorHandler . defloc
Intriguingly, Grok produces no output for this query, indicating that ErrorHandler has no declaration. We can conclude that ErrorHandler has a definition (we could write a query to explicitly check this), but no declaration.
What we have learned is that ErrorHandler is programmed with a peculiar style, with a definition but no declaration. While this usage is allowed by the C language, it a frowned upon as being bad style by many programmers. This is particularly suspicious, because we have also discovered that this function is part of semantics.c, but is called from codegen.c. To see how widely ErrorHandler is used, we type:
linkcall
. ErrorHandler
As a result Grok outputs only
generateCode
This tells us that ErrorHandler is called (outside of semantics.c, externally via linking) by only the generateCode function. This is further evidence that ErrorHandler is used in a peculiar fashion, and perhaps should be moved to a new location, or perhaps that generateCode should be calling a function other than ErrorHandler.
We will not explore this peculiarity further here. Our goal was to show how to solve the Where-Did-This-Arrow-Come-From Problem. To keep the explanation simple, we have used a straightforward example. In more complex cases, it is easy to get lost in the details. Keeping track of the queries, and avoiding mistakes when typing the queries, is tedious and can be error prone. At the same time, it can also be rewarding, leading to discoveries of unexpected and sometimes incorrect parts of the program. Indeed, it is intellectually interesting to be able to navigate through the structure of a piece of software.
Credits. Gerry Kovan, with assistance from Ric Holt, created the CFX TA schema in Figure 5.3. Ivan Bowman created the E/R Diagram for this schema, in Figure 5.4.
Part 4.
Writing Grok Scripts
The preceding section showed how the Grok interpreter can execute user command interactively. We showed how this facility can be used to determine what functions in file codegen.c call what functions in file semantics.c.
While it is handy to execute Grok commands interpretively, it is also handy to bundle up a set of commands and run them as a script. Figure 6.1 gives a script that solves this same Where-Did-This-Arrow-Come-From Problem.
1 %
Find call edges from P to Q
2
getdb "factbase.rsf"
3 P
:= {"codegen.c"} . funcdef
4
countP := # P
5 put
"codegen.c contains ", countP, " function defs"
6 Q
:= {"semantics.c"} . (funcdef + funcdcl)
7
countQ := # Q
8 put
"semantics.c contains ", countQ, " function defs and dcls"
9 put
"Here are the linkcalls from codegen.c to semantics.c"
10 id (P)
o linkcall o id
(Q)
Figure 6.1. Grok Script comefrom.grk
The Grok script in Figure 6.1 contains roughly the same commands that were given in the previous section as interactive command. (The line numbers in the figure are only for reference purposes, and so not actually appear in the script.) In the figure, we have inserted a comment in line 1; as you can see, comments start with a percent % character.
In line 2, we added quotation remarks around factbase.rsf. In lines 3 and 5, we added quotation marks as well as set brackets { … } around the file names. These additions make the commands syntactically correct. In the interactive commands, we avoided some typing by taking syntactic shortcuts, which are, officially speaking, illegal, but since the interpreter is forgiving, we saved ourselves a few keystrokes.
Aside: While it is convenient to take shortcuts when typing interactive commands, you should avoid these and eliminate syntax errors in a Grok script. Grok scripts are hard enough to write without getting confused by error messages. Besides, error messages often help isolate serious bugs in the script.
Lines 5, 8 and 9 illustrate the use of the put statement. It accepts sequences of items separated by commas. These items can be strings, as well as names of variables.
We can run this script from the command line by typing
grok comefrom.grk
When we do this, we will get this output:
codegen.c contains 21 file defs
semantics.c
contains 17 file defs and dcls
Here
are the linkcalls from codegen.c to semantics.c
generateCode ErrorHandler
There is another way to run this script, namely, we can give an interactive command to invoke it. To do this, we can first start up the Grok interpreter by typing
grok
Once Grok starts up, we can give this command:
exec comefrom.grk
This runs the script and produces the same output as has just been shown. When the script finishes running, Grok will still in interactive mode, so we can continue giving it commands. For example, we could type this simple command:
P
This command lists the items contained in set P (these are the functions defined in codegen.c). More generally, using the exec command in interactive mode is helpful because it allows us to inspect the values of relations and variables that have been computed by the script. The exec command can also be used in a script, to all another script as a subroutine. For example, the command exec comefrom.grk could appear in a script.
Our script solves one particular Where-Did-This-Arrow-Come-From Problem. However, it works only for two particular files, codegen.c and semantics.c and only for a factbase called factbase.rsf. We can generalize it to handle any files and any factbase by using arguments, as is illustrated in Figure 6.2.
1 %
Find call edges from P to Q
2 if
$n ~= 3 then % Are there 3 arguments?
3
put "Usage: grok
comefrom.grk factbaseRsf fromFileName toFileName"
4
put "Finds linkcalls from fromFileName to toFileNamec "
5
put "Author: Ric Holt,
Date: 22 Apr 2002"
6
quit
7 end
if
8
9
factbaseRsf := $1
10 fromFileName := $2
11 toFileName := $3
12
13 getdb factbaseRsf
14 P := { fromFileName } . funcdef
15 countP := # P
16 put fromFileName, " contains ",
countP, " function defs"
17 Q := { toFileName } . (funcdef + funcdcl)
18 countQ := # Q
19 put fromFileName, " contains ",
countQ, " function defs and dcls"
20 put "Here are the linkcalls from
", fromFileName, " to ", toFileName
21 id (P)
o linkcall o id
(Q)
Figure 6.2. Grok Script comefrom.grk Rewritten with Arguments
In
the Grok script, the notation $n (line 2) gives the number of arguments, and
$1, $2, $3, etc. (lines 9 –11) give the values of arguments 1, 2, 3, etc. Grok’s form of if statement is
illustrated by lines 2 to 7. The quit
statement (line 6) aborts the script.
The
script can be run by giving the command
grok comefrom.grk factbase.rsf codegen.c
semantics.c
or in interactive mode by typing
exec comefrom.grk factbase.rsf codegen.c
semantics.c
Other
file names and another name of a factbase can be given as arguments.
If
a script is used often, it is good to create a shell script that invokes it, in
order to simplify use of the script.
For example, we might create a shell script named comefrom that
consists of this line:
grok comefrom.grk $@
The
form $@ matches all the arguments passed to
Grok. This shell script allows us to use comefrom by just typing comefrom
factbase.rsf codegen.c semantics.c.
For this to be generally useful, the shell script should provide an
absolute path (or a path determined by a shell variable) so that comefrom.grk
can be found from any directory.
Invoking a Script from a Library
This
exec command can be used in another Grok script. When this is done, care must be taken to be
sure that the script, comefrom.grk in our example, can be located by
Grok. By default, Grok will expect to
find the script in the local directory.
The SwagKit project, at the University of Waterloo, has a library of
commonly used Grok scripts. These are
stored in a directory, which is located by a shell variable called
SWAGKIT_PATH. Figure 6.3 gives an
example of using this shell variable.
execPath := getenv "SWAGKIT_PATH"
linkProcedure := execPath cat "/lib/pipeline/" cat
"lift.grp"
exec linkProcedure
Figure 6.3. Using a Shell Variable to Locate a Grok Script
In the figure, the Grok built-in function getenv fetches the value of the SWAGKIT_PATH shell variable. This is assigned to string variable execPath. The complete path to the Grok script is built up by catenating execPath, /lib/pipeline/ and lift.grp. If SWAGKIT_PATH has the value /home/holt/dev_swagkit (as it currently does for the author of this note), the complete path for the Grok script will be
/home/holt/dev_swagkit/lib/pipeline/lift.grp
(Note: The suffix .grp is used by convention to label a Grok procedure.) If you build up a library of Grok scripts, you can use a shell variable to locate your scripts, as is done for SwagKit.
A common use of Grok scripts is to read a factbase and to simplify or abstract it. One way of doing this is to combine sets of similar relations into a single relation. Figure 6.4 gives a script, named cfxadddep.grk, which combines all dependencies in a CFX factbase into a single relation called dep. The factbase, with dep added to it, is written out to a new factbase.
1 %
Add dependency 'dep' to CFX fact base.
2 if
$n ~= 2 then % Are there 2 arguments?
3
put "Usage: grok
cfxadddep.grk factbaseRsf newFactBaseRsf"
4
put "Combines all dependencies into a single dependency 'dep'"
5
put "Author: Ric Holt,
Date: 22 Apr 2002"
6
quit
7 end
if
8
9
factbaseRsf := $1
10 newFactbaseRsf := $2
11
12 cfxDep := { "include",
"sourcecall", "sourceref", "linkcall", \
13 "linkref",
"usemacro", "usetype", "usestruct",
"useunion", \
14 "useenum",
"librarycall" }
15
16 for d in cfxDep
17 $ d := EMPTYREL
18 end for
19
20 getdb factbaseRsf
21
22 dep := EMPTYREL
23 for d in cfxDep
24 dep := dep + $ d
25 end for
26
27 putdb newFactbaseRsf
Figure 6.4. Combining Dependency Relations
Lines 12-14 of the script create the set cfxDep, which contains the names of all dependencies in the CFX schema (see Figure 5.3 and 5.4). Note that a backslash \ has been used at the ends of lines 12 and 13 to extend the command to multiple lines.
Lines 16-18 and 23-25 are examples of Grok for statements. These iterate across elements of a set of strings. In each of these, the loop is executes with d assigned each of the values contained in set cfxDep. We know that the iterations will assign d each value from the set, but we are not guaranteed what the ordering will be.
Line 17 contains the statement
$ d
:= EMPTYREL
The dollar sign $ means that the value of
d is to be treated as a relation or variable. For example, when d has the value sourcecall, this
statement will effectively be
sourcecall
:= EMPTYREL
The reason that we have assigned the value EMPTYREL is a bit subtle. This assignment acts like a “declaration” of the relation. This “declaration” needs to precede the statement getdb factbaseRsf, so that, if the factbase does not contain any pairs in the relation, then the relation still exists (is declared), even though it is empty. Without this “declaration”, if there were no pairs in the relation, the relation would not exist, as far as the Grok interpreter is concerned. This is in turn would cause problems in line 24, in which the dep relation is constructed from all the dependency relations. If d has a value (such as sourcecall) in line 24 which isn’t the name of a relation, the Grok interpreter will not know how to execute the line. To avoid this difficulty, any relation being read from an RSF or TA stream, which might not have any pairs in the stream, should be assigned EMPTYREL before reading the stream.
Combining dependencies into “higher level” dependencies is a common way to abstract information in a graph. Another way is to eliminate low-level entities, such as statements, or even functions within files. When this is done, the dependencies from the low level entities are usually lifted (also called raised) to the level of their parent nodes. Grok scripts to do this lifting are part of the PBS and SwagKit pipelines that prepare graphs for visualization.
We have introduced key features of Grok by means of examples. We will finish by mentioning a couple more features.
It is common to need to delete some relations or some sets of nodes from a factbase. For example, in the script in Figure 6.4, after adding the dep relation we might want to delete all the relations, which were combined into dep. This can be done as follows
for d in cfxDep
delrel $ d
end for
This loop uses the delrel command to delete relations from the factbase. To remove a set of nodes from the factbase, whose names are in a set nodesToDelete, we can use this statement
delset
nodesToDelete
This command removes all triples in the factbase that reference nodes in nodesToDelete.
When debugging a Grok script, the following command can be very helpful:
option
echo
This tells the Grok interpreter to trace the script, outputting each command as it is executed. The command
option timing
outputs the time taken to execute each Grok command.
We will now give a final observation about debugging Grok scripts. Grok does not have a compiler, so there is no direct way to see if your script is syntactically legal. However, the following trick works reasonably well: run the script with dummy values for the arguments. This makes Grok run the script (with some bogus messages about bad arguments), but more importantly, gives messages about other syntax errors. You should fix such errors before testing your script on real data.
There are a number of other commands, which are summarized in the following section. This concludes our introduction to Grok.
Part 5.
Summary of Grok Features
1.
Grok input/output commands
Grok
input commands:
getdb F Read RSF factbase from file F
adddb F Add new tuples to data base from file F
getta F Read TA factbase from file F
addta F Add more TA to date base from file F
getset s [F [D]] Read set s [from file
F [in dir D]]
Grok
output commands:
putdb F Write RSF data base to file F
relToFile R F Write rel R to file F
appendRelToFile R F Append rel R to file F
putta F Write TA factbase to file F
putset s [F [D]] Write set s [to file F [in dir D]]
put msg {, msg} Print msgs; each is a string or a variable name
The
file format for relations for get/put is ASCII triples, eg,
R a b
where R relates a to b (this format is called 'RSF').
The
file format for sets is blank-separated strings.
Beware
that the read/write (compressed) format isn't portable.
When
TA is read in, attributes are treated as relations (with @_
added
as a prefix). The relations and
attributes at the scheme level
have
$_ prepended to their names.
2.
Grok relation operators:
R1 + R2 Union of relations
R1 - R2 Difference of relations
R1 ^ R2 Intersection of relations
R1 o R2 Relational composition [was previously '*']
inv R Inverse of R
s . R Project set s through rel R [was *]
R . s Project set s through rel R backwards [was *]
R+ Transitive closure of R
R* Reflective transitive closure of R
R1 cmp R2 Compare: cmp is == [or =], ~=, <, <=, >, >=
EMPTYREL The empty relation
# R Cardinality (number of tuples) of R
dom R Domain of R (a set)
rng R Range of R (a set)
ent R Range + domain of R (all entities in R)
id s Identity relation on set s
DOM Domain of all relations in factbase (a set)
RNG
Range of all relations
in factbase (a set)
ENT Range + domain of all relations (a set)
ID The identity relation (based on ENT)
head R Head of R, eg (x y) in R produces ((x R y) head y)
tail R Tail of R, eg (x y) in R produces (x tail (x R y))
rel2tripleset R Make R into a set, eg (x y) produces (x R y)
tripl2relname Given '(x R y)' produce 'R'
3.
Other relation commands:
relnames Set of the names of relations
delrel R Remove R's tuples factbase
listdb Print triples in factbase
sample R [b [n] ] Print sample from R.
Begin with
number b (default 1)
& print n (10)
slice s Remove tuples
that reference other than s
wslice s Remove tuples that do not reference s
delset s Remove tuples from factbase that reference s
delsetset s t Remove tuples with source from s and target from t
deldups Delete any duplicate tuples in factbase
dbsize Number of tuples in factbase
cleardb Delete all relations
4.
Grok set operators:
s1 + s2 Union of sets
s1 - s2 Difference of sets
s1 ^ s2 Intersection of sets
s1 cmp s2 Compare: cmp is == [or =], ~=, <, <=, >, >=
# s Cardinality (number of elements) of s
s1 X s2 Cross product s1 X s2 (a relation)
{ e1, e2, ... } Construct set containing e1, e2, ...
EMPTYSET The empty set.
You can use { } instead
pick s Pick a value (a string) from s
prefix Q s Set of items in s with prefix Q
suffix s Q Set of items in s with suffix Q
5.
Other set commands:
setnames Print the names of set variables
sample s [b [n] ] Print sample from s.
Begin with
number b (default 1)
& print n (10)
clearsets Delete all sets
addprefix Q s Add prefix Q to items in set s
addsuffix s Q Add suffix Q to items in set s
addprefixrel Q s Add prefix Q to rel names in set s
addsuffixrel s Q Add suffix Q to rel names in set s
addprefixdb Q s Add prefix Q to src & trg of rel names in set s
addprefixsrc Q s Add prefix Q to src of rel names in set s
addsuffixsrc s Q Add suffix Q to src of rel names in set s
addprefixtrg Q s Add prefix Q to trg of rel names in set s
addsuffixtrg s Q Add suffix Q to trg of rel names in set s
delprefix Q s Delete prefix Q to items in set s
delsuffix s Q Delete suffix Q to items in set s
delprefixrel Q s Delete prefix Q to rel names in set s
delsuffixrel s Q Delete suffix Q to rel names in set s
delprefixsrc Q s Delete prefix Q to src of rel names in set s
delsuffixsrc s Q Delete suffix Q to src of rel names in set s
delprefixtrg Q s Delete prefix Q to trg of rel names in set s
delsuffixtrg s Q Delete suffix Q to trg of rel names in set s
duplicate s Creates new set with elements e.dup for
each element e of
s, with same
connecting
relations as in s
6.
Grok number operators:
n1 + n2 Addition
n1 - n2 Subtraction
n1 * n2 Multiplication
n1 / n2 Real division
n1 div n2 Truncating division
n1 mod n2 Modulo
n1 cmp n2 Compare: cmp is
== [or =], ~=, <, <=, >, >=
7.
Grok string operators
S1 cat S2 Catenate strings S1 and S2
S1 cmp S2 Compare: cmp is == [or =], ~=, <, <=, >, >=
s : Out edges from node s
: s In edges
to node s
8.
Grok boolean operators:
b1 and b2 Conjunction
b1 or b2 Disjunction
not b Boolean negation
9.
Levels of precedence are, from low to high (most binding):
and
or
cmp
(==, ~=, <,
<=, >, >=)
+ - cat (additions)
* X
prefixOperators (inv dom rng ent - not ~ # pick $)
suffixOperators (*, +, ie, transitive closure)
Parentheses
(...) can be used to specify order of operations
10.
Statements:
x := expn Assign expn to variable x (an identifier)
$ x := expn Assign expn to variable whose name is in x
(The 'x = expn' form is considered to
be obsolete.)
if expn then Cascaded 'if' statement, with optional
statements 'elsif' and 'else' clauses
{elsif expn then
statements}
[else
statements]
end if
loop Loop statement, stopped by 'exit'
statements
end loop
for e in s
Repeat with each element
e from set s
statements (exits are supported)
end for
exit Exit from loop/for
exit when expn Exit from loop/for when expn is true
return Return from current Grok script
quit Quit (halt Grok)
11.
Other Grok commands:
varnames List of all relation, set and
string variables
$ str String 'str' is taken as a variable's name
If value is integer i,
evaluates to arg i
$i Value of parameter i, i is 0, 1, 2 ...
$n Number of parameters
getenv s Get environment value s
sysexec s Execute shell command s. Value is return code
expn The value of expn (expression) is printed
exec F p1 ... Exec file F (Grok script) with parameters p1 ...
where t is each
member of s
randdb t r n Generate t tuples with r relations and
n nodes.
? Print the help menu
reset Restart (clear relations and variables)
Use
'\' in last column to extend commands across line boundaries
12.
Setting Grok options:
option Q Set option Q
The
options that can be set include:
timing Print time to complete each operation
echo Echo each command (useful for tracing scripts)
debug Interpret 'debug comand' as 'command' (default off)
verbal Print out extra info during commands
progress Show progress on large operations (verbose)
optimize Run fast (default)
dumpsets Dump all sets and set variables (sets.tu)
dumpnames Dump all stored names (storage.tu)
The
first of these are turned off using 'no', as in 'option notiming'.
13.
Limits on Sizes in Grok
Many/most
data sizes are dynamically increased up to the maximum available memory. The dynamically adjusted sizes are for:
Triples
Names
(Along with internal tables using
for sorting, etc.)
Sizes
which are statically fixed are:
Maximum length of names; strSize=255
bytes, can increase to 255 at most
Maximum tokens on a source Grok line;
maxTokens (can be increased)