Introduction to the Grok Language

Ric Holt, 5 May 2002

Abstract. This note introduces Grok, a notation for manipulating binary relations. Grok has an interpreter, which can be considered to be a relational calculator. This interpreter has been used extensively for analyzing factbases produced by parsers that extract information from source programs.

Background and Overview. The initial version of Grok was created by the author in 1995. It has evolved to become a language for manipulating factbases. Grok operates at the level of a relational database, in that operators generally apply across entire relations, and not just to single entities. The Grok interpreter has been optimized to handle large factbases (up to several hundred thousands of facts). It keeps all of its data structures in memory. It is written in the Turing language.

The term grok means to understand, which is appropriate in that the Grok language is used to help understand large-scale software. This term was coined in the book Stranger in a Strange Land, by Robert A. Heinlein.

This report consists of five parts. Part 1 introduces Grok’s set and relation operators using an example involving Garfield the cat. These operators are summarized in Part 2.

Part 3 shows how Grok can be used to interactively query factbases. It uses an example to introduce schemas (also called schemes) that characterize the form of factbases. The reader may want to consult the related report on TA (the Tuple Attribute languages) when reading this part. Part 4 shows how scripts written in Grok can be used to query and transform factbases. Finally, Part 5 summarizes the features of Grok.

Part 1. Introduction to Grok’s Operators

Part 2. Summary of Grok’s Operators

Part 3. Grok Queries, Factbases and Schemas

Part 4. Writing Grok Scripts

Part 5. Summary of Grok Features

Part 1. Introduction to Grok’s Operators

We will introduce Grok’s set and relation operators using an example in which there are two cats (Garfield and Fluffy), two mice (Mickey and Nancy) and two pieces of cheese (Roquefort and Swiss). We will begin by defining sets for these cats, mice and cheese.

Set constructors. In the Grok language we write

cat := {“Garfield”, “Fluffy”}

to construct the set variable named cat containing the strings Garfield and Fluffy. Similarly, we can create set variables mouse and cheese:

mouse := {“Mickey”, “Nancy”}

cheese := {“Roquefort”, “Swiss”}

Set union, intersection and subtraction. We can combine sets. For example, the Grok statement

animal := cat + mouse

creates the set called animal which is the union of the cat and mouse sets. As a result, animal contains the four strings Garfield, Fluffy, Mickey and Nancy. Similarly, we can take the union of the mouse and cheese sets to create the set of items that are food:

food := mouse + cheese

As a result, the food set will contain Mickey, Nancy, Swiss and Roquefort.

Intersecting animals with food, we can find the set of animals that are food:

animalsWhichAreFood := animals ^ food

These animals are Mickey and Nancy. We can use set subtraction to find those animals, which are not food:

animalsWhichAreNotFood := animals – food

These animals are Garfield and Fluffy.

Set cardinality. We can ask how many items are food:

howManyThingsAreFood := # food

This creates a new scalar (string) variable called howManyThingsAreFood, which is set to 4. This statement uses the cardinality operator # to return the size of the set

{“Mickey”, “Nancy”, “Swiss”, “Roquefort”}.

Command line statements. When Grok is used as a command line interpreter, you can directly enter statements, such as the example Grok statements we have given so far. You can type an expression or value, and the Grok interpreter will evaluate it and print it. For example, if you have already entered the above statements and you type

mouse

then Grok will print back to you

Mickey

Nancy

Sets are not ordered. Since ordering is immaterial in sets, we do not know if Grok will output Mickey before or after Nancy.

Script language. Grok is also a script language, so you can write a Grok program, consisting of Grok statements, to be executed by the Grok interpreter.

Set comparisons. We can compare sets, for example:

mouse <= food

This determines whether mouse is a subset of food. Both elements of mouse (Mickey and Nancy), are elements of food, so the Grok interpreter will output true.

The Grok set comparison operators are given in Figure 1.1.

==	equal, which can also be written as a single equal sign
~=	not equal
<=	subset
>=	superset
<	strict subset
>	strict superset

Figure 1.1. Grok Comparison Operators

Summary of set operators. This section has illustrated how sets are constructed using the notation {item1, item2, …}, how these sets can be combined using the union (+), intersection (^), and subtraction (-) operators, and how they can be compared, using <=, =, etc. As well, we introduced the cardinality operator (#).

Relations in Grok

We will now introduce Grok’s relations and relational operators. In Grok, a relation is a set of ordered pairs. For example, a relation might consist of the two pairs (Garfield, Fluffy) and (Mickey, Nancy). This relation might be the love relation, signifying that Garfield loves Fluffy and Mickey loves Nancy.

Cross product. We can use the cross product operator X (capital letter X) to create relations, for example:

chase := cat X mouse

This can be considered to mean that cats chase mice. As defined by this statement, the chase relation consists of these four pairs

Garfield Mickey

Garfield Nancy

Fluffy Mickey

Fluffy Nancy

In other words, both cats chase both mice. In general, the cross product of sets S and T produces every pair (x, y) such that x is in S and y is in T.

We can create the eat relation this way

eat := chase + mouse X cheese

This creates the eat relation to be the same as chase, but with the addition (union) of the cross product of mice and cheese. As a result, the eat relation will contain these eight pairs:

Garfield Mickey

Garfield Nancy

Fluffy Mickey

Fluffy Nancy

Mickey Swiss

Mickey Roquefort

Nancy Swiss

Nancy Roquefort

We can interpret this to mean that cats eat mice and mice eat cheese.

Operator precedence. Grok uses the traditional precedence of operators from mathematics, namely, additions (as well unions and subtractions) are done after multiplications (as well as intersections and cross products). For example, in the statement eat := chase + mouse X cheese, the cross product X is done before union (+).

Relation cardinality. We can use the cardinality operator # to ask how many pairs are in a relation:

howManyPairs := # chase

This will create the string variable howManyPairs and will assign it the value 4.

RSF. We sometimes represent the facts of a collected set of relations as triples of the form (R, x, y), where relation R contains the pair (x, y). For example, the chase and eat relations, taken together, can be represented this way:

chase Garfield Mickey

chase Garfield Nancy

chase Fluffy Mickey

chase Fluffy Nancy

eat Garfield Mickey

eat Garfield Nancy

eat Fluffy Mickey

eat Fluffy Nancy

eat Mickey Swiss

eat Mickey Roquefort

eat Nancy Swiss

eat Nancy Roquefort

This list notation is called RSF (Rigi Standard Format), because it was developed for use in the Rigi reverse engineering tool. We sometimes call a triple a fact, and we sometimes call these lists factbases. Ordering is immaterial in relations, so we can change the ordering of the lines in this list without changing the meaning of the list.

Set projectors. We can use the set projector operator, written as a dot, to determine how a set is related to another set. For example, we can determine what Mickey eats, by asking the question: How is the set {“Mickey”} related, via eat, to another set:

mickeyEatsWhat := {“Mickey”} . eat

This statement assigns mickeyEatsWhat the set containing Swiss and Roquefort. Similarly, we can ask what eats Mickey:

whatEatsMickey := eat . {“Mickey”}

This statement assigns whatEatsMickey the set containing Garfield and Fluffy. In the general case, when s is a set and R is a relation, the form s . R produces the set t such that if s contains x and R contains (x, y) then t contains y. The form R . s produces the set t such that if R contains (x, y) and s contains y then t contains x.

Domains and ranges. The dom (domain) of a relation is the set of all beginnings (left ends) of pairs in a relation. For example

eater := dom eat

This statement assigns eater the set of things that eat. This set consists of Garfield, Fluffy, Mickey and Nancy. The statement

food := rng eat

assigns food the set of things that are eaten. This set consists of Mickey, Nancy, Swiss and Roquefort.

The sink of a relation R is the set of items that are in R’s range but not in its domain. We can compute the sink for the eat relation as follows:

bottomOfFoodChain := rng eat – dom eat

This assigns bottomOfFoodChain the set containing Swiss and Roquefort. The source of a relation R is the set of items that are in the relation’s domain, but not in its range, as in:

topOfFoodChain := dom eat – rng eat

This assigns topOfFoodChain the set containing Garfield and Fluffy.

Relation union, intersection and subtraction. There are a number of Grok operators that combine relations. For example, we can intersect the eat and chase relations to tell us about both eating and chasing:

bothEatAndChase := eat ^ chase

This statement assigns bothEatAndChase this set of pairs:

Garfield Mickey

Garfield Nancy

Fluffy Mickey

Fluffy Nancy

This means that Garfield both chases and eats Mickey, that Garfield both eats and chases Nancy, etc. Relation intersection (^), returns the sets of pairs that are common to two relations. Analogously, relation union (+), returns all pairs that exist in either of the two relations.

Subtraction (-) is another relation operator, for example,

eatButNotChase := eat – chase

This assigns eatButNotChase those pairs in eat that are not in chase. As a result, eatButNotChase records that fact that mice eat but do not chase cheese.

Similarly, we can determine the relation for chasing but not eating:

chaseButNotEat := chase – eat

It turns out that this is the empty relation, because in our example, all chased things (the mice) are eaten (by the cats).

Empty relation. We could as well have written

chaseButNotEat := EMPTYREL

EMPTYREL is Grok notation for the relation containing no pairs. Another way to accomplish the same thing is to write the cross product of two empty sets:

ChaseButNotEat := { } X { }

Relation comparisons. It turns out that the relation bothEatAndChase and chase are identical, which we could check using comparison of relations, by writing:

bothEatAndChase == chase

This will return the value true. We can compare relations using the comparison operators that apply to sets, namely, ==, ~=, >=, <=. > and <. Each relation is actually a set of pairs, and these operators have the same meanings as they do for proper sets.

Relation composition and transitive closure. We will now introduce the composition operator (o). Relation R1 composed with relation R2 produces a new relation by a join in which each result pair is formed by attaching the right end of a pair in R1 to the left end of a pair in R2. Consider this example,

secondOrderEat := eat o eat

As a result, secondOrderEat will contain pairs such as (Garfield, Swiss) because Garfield eats something which eats Swiss. In general, if R1 contains (x, y) and R2 contains (y, z), then the result of R1 o R2 will contain (x, z)

The transitive closure of a relation R composes R with itself repeatedly, for example:

anyOrderEat := eat +

We can think of the result as “eats at any level of indirection”. The relation anyOrderEat will contain the pair that are in eat, such as (Garfield, Mickey) and (Mickey, Swiss), as well as second order edges such as (Garfield, Swiss). In this example, there are no higher level indirections of eating, due to the simple nature of our eat relation.

We can also compute the reflexive transitive closure of a relation R, written as R*, which is the same as transitive closure, with the addition of self loops, (x, x), on all items in the factbase.

Set id (identity) operator. A set of self loops, on a particular set of nodes, can be computed this way:

isMouse := id (mouse)

This assigns isMouse this set of edges:

Mickey Mickey

Nancy Nancy

In general, if set s contains x, then id s contains (x, x).

The id (identity) operator, and its self loops, are handy for restricting a relation to start or to end with a given set of items. For example, we can select the pairs in the eat relation that are eaten by mice as follows:

eatByMouse:= id (mouse) o eat

The resulting eatByMouse relation contains only the pairs corresponding to eating by mice. This composition (o) is computed by starting with the self loops in id (mouse). This guarantees that each result pair starts with a mouse. These self loops are “joined” with pairs from eat. The resulting pairs corresponding to eating by mice.

The identity of all nodes in the current Grok factbase is written as ID.

Relation inverse. Another relation operation is the inverse (or reverse) operator, for example:

chasedBy := inv chase

For each pair (x, y) in chase, this produces the pair (y, x) in chasedBy. So if Garfield chases Mickey then Mickey is chasedBy Garfield.

Part 2. Summary of Grok’s Operators

This section summarizes the Grok operators that were introduced using the example involving Garfield et al.

Union, intersection, subtraction, comparison and cardinality. Using the example of the cats, mice and cheese, we have introduced the basic set and relation operators of Grok. These include the following five operations given in Figure 2.1.

Operator	Form
Union	x + y
Intersection	x ^ y
Subtraction	x - y
Comparison	x = y, x <= y, …
Cardinality	# x

Figure 2.1. Operators used with sets and with relations

The two operands (x and y in above examples) of union, intersection, subtraction and comparison must both be sets or else both be relations.

Set and relation constructors. Sets can be created using a set constructor, for example:

{ x, y, z }

The empty set can be written as

{ }

or as

EMPTYSET

Relations can be created using cross products, for example:

{ a, b } X { x, y, z }

The result of the cross produce of sets s and t, s X t, contains all pairs (x, y) such that x is in s and y is in t.

A collection of relations can be represented in RSF format, i.e., as a list of triples. RSF format is used for files, but cannot be used in a Grok command.

The empty relation can be written as

EMPTYREL

or as

{} X {}

Domain and range. The domain (set of initial nodes) and range (set of final nodes) of relation R are written as:

dom R

rng R

Set projectors. A set s can be combined with relation R in a set projector using the form

s . R

R . s

The form s . R produces the set t such that if s contains x and R contains (x, y) then t contains y. The form R . s produces the set t such that if R contains (x, y) and s contains y then t contains x.

Identity relations. The identity relation on set s is written as:

id s

If s contains node x then id s contains the pair (x, x). The identity relation over all nodes in relations in Grok’s current factbase is written as:

Relation inverse. The inverse (or reverse) of relation R is written as
inv R

If R contains pair (x, y) then its inverse contains (y, x).

Relation composition. The composition of relations R1 and R2 is written as:

R1 o R2

If R1 contains the pair (x, y) and R2 contains the pair (y, z), then their composition is a relation that contains the pair (x, z). (Composition is much like a “join” in a conventional relational database.)

Transitive closure. The (non-reflexive) transitive closure of relation R is written as:

R +

The transitive closure consists the union of R composed with itself any number of times:

R + = R + (R o R) + (R o R o R) + …

The symmetric transitive closure is written as

R *

The value of R* is the same as R+ with the addition of self loops (ID):

R * = ID + R+

Part 3. Grok Queries, Factbases and Schemas

This section shows how the Grok interpreter can be used as an interactive query engine. Before doing this, we will introduce an example C program and will show how facts about that program can be represented as a factbase. We will then give example Grok queries that probe these facts.

The EG Example: A Very Simple C Program

We will use a very simple C program, called EG, as an example. Our C program consists of three files: main.c, putint.c and putint.h. The first, main.c, sets a global variable to 5 and calls the putint function with a parameter of 10. The putint.h file declares a global variable to be an integer (as external) and gives the header of the putint function (as external). The putint.c program gives the body of the putint function, which computes the product of its parameter and the global variable and prints the result. These files are listed in Figure 5.1.

File main.c:

      #include "putint.h"

      void main(void){

          global_var=5;

          putint(10);

File putint.h:

      #ifndef _PUTINT_H_

      #define _PUTINT_H_

      extern int global_var;

      extern int putint(int param);

      #endif

File putint.c:

      #include "putint.h"

      int global_var;

      int putint(int param) {

          int local_var;

          local_var=global_var*param;

          printf("The number is:%d\n",local_var);

Figure 5.1. EG: A Very Simple C Program

Facts Extracted from EG

The PBS project, at the Universities of Toronto and Waterloo, developed a special purpose version of the GCC compiler front end called CFX (C Fact Extractor), which translates C programs into graphs. In the case of the EG program the translation produces the graph as the stream of triples shown in Figure 5.2. (To be technically correct, we should say that CFX produces tables, which are translated into the stream by a program called fbgen. To keep things simple, we will consider that CFX directly produces the stream.) CFX stores this output stream as RSF triples in a file called factbase.rsf.

funcdef main.c main

defloc main main.c:3

include main.c putint.h

linkcall main putint

linkref main global_var

macrodef *Initialization* __GCC_NEW_VARARGS__

macrodef *Initialization* __GNUC_MINOR__

macrodef *Initialization* __GNUC__

macrodef *Initialization* __sparc

macrodef *Initialization* __sparc__

macrodef *Initialization* __sun

macrodef *Initialization* __sun__

macrodef *Initialization* __unix

macrodef *Initialization* __unix__

macrodef *Initialization* sparc

macrodef *Initialization* sun

macrodef *Initialization* unix

funcdcl putint.h putint

dclloc putint putint.h:5

vardcl putint.h global_var

vardclloc global_var putint.h:4

macrodef putint.h PUTINT_H

funcdef putint.c putint

defloc putint putint.c:4

vardef putint.c global_var

vardefloc global_var putint.c:3

include putint.c putint.h

include putint.c putint.h

librarycall putint printf

Figure 5.2. The EG Program, as Translated by CFX to RSF Triples

As a reader, you do not need to concern yourself with the details of this stream. Our purpose in giving this stream is to illustrate the kinds of streams that Grok deals with. In this stream, the triple

funcdcl F p

means file F contains a function declaration (function header) for p. The triple

funcdef F p

means file F contains a function definition (function body) for p.

CFX Schema

Figure 5.3 gives the list of the relations produced by CFX, in the form of a TA schema, with brief explanations of each relation.  (Aside: in the last part of the schema, the string file:lineNum has been placed in quotes.  This is done because the Grok interpreter puts quotes around strings containing colons.)

SCHEME TUPLE :

funcdef file function // function is defined in file

funcdcl file function // function declared in file

funcdcl file libraryfunction // libraryfunction declared in file

vardef file variable // var defined in file

vardcl file variable // var declared in file

macrodef file macro // macro defined in file

typedef file type // type defined in file

structdef file struct // struct defined in file

uniondef file union // union defined in file

enumdef file enum // enum defined in file

include file file // file 1 includes file 2

sourcecall function function // function 1 calls function 2

// ( function 2 definition is

// in same compilation unit

// OR

// is included in same compilation

// unit as function 1 )

sourceref function variable // function references variable

// ( variable definition is

// in same compilation unit

// OR

// is included in same compilation

// unit as variable )

linkcall function function // function 1 calls function 2

// ( function 2's definition

// must be linked to )

linkref function variable // function references variable

// ( variable definition must

// be linked to )

usemacro file macro // file uses macro

usetype file type // file uses type

usestruct file struct // file uses struct

useunion file union // file uses union

useenum file enum // file uses enum

librarycall function libraryfunction // function calls libraryfunction

// ( a libraryfunction is a func

// whose definition does not

// exist in the analysed code )

// Attribute tuples extracted by CFX

defloc function “file:lineNum” // func definition location

dclloc function “file:lineNum” // func declaration location

deflocend function “file:lineNum” // func definition end location

vardefloc variable “file:lineNum” // var definition location

vardclloc variable “file:lineNum” // var declaration location

macrodefloc macro “file:lineNum” // macro definition location

typedefloc type “file:lineNum” // type definition location

enumdefloc enum “file:lineNum” // enum definition location

structdefloc struct “file:lineNum” // struct definition location

uniondefloc union “file:lineNum” // union definition location

enumdeflocend enum “file:lineNum” // enum definition end location

structdeflocend struct “file:lineNum” // struct definition end location

uniondeflocend union “file:lineNum” // union definition end location

Figure 5.3. The CFX Schema (Also Called the PBS Low Level Schema)

The notation for the schema given in Figure 5.3 has, almost, the form of a TA scheme (schema).  However, the form of this schema is a bit dated (CFX and this schema were created before TA was defined), in that the attributes are written as relations.  For example, this line in the schema

    defloc function     “file:lineNum”

would be expected to be written in TA as

      function { defloc = “file:lineNum” }

However, we can treat attributes as relations with no harm done, so we will not be concerned with this difference.

CFX Schema as an Entity/Relation Diagram

We can draw a schema as a diagram.  For example, the CFX schema is shown as a diagram in Figure 5.4.  This diagram has the form (almost) of an Entity/Relation Diagram; E/R Diagrams are commonly used in the design of relational databases.

Figure 5.4. The CFX Schema as a Diagram

Querying EG’s CFX Factbase

We can use the stream shown in Figure 5.2 as the basis for answering questions about the EG program. In particular, we can use Grok to ask queries about the EG program. To do this, we can start up the Grok interpreter (the command for Grok is grok or grok.x). Our first command to Grok is:

getdb “factbase.rsf”

This reads the stream into Grok’s internal memory. (Aside: In interactive mode, Grok is forgiving of syntactic details, so when we are typing this interactively, we can omit the quotation marks and type simply getdb factbase.rsf.)

We can ask Grok to list the entire database by typing

listdb

When we do this, Grok produces a list of triples much like that in Figure 5.2. However, its ordering of triples may be different, because the order of relations is immaterial to Grok.

We can determine the number of triples by typing

dbsize

Grok will output the number 29, which is the count of the triples in the list.

For unknown reasons, the CFX extractor sometimes produces repeated copies of triples. In Figure 5.2, the factbase.rsf stream contains two copies of the line

     include putint.c putint.h

which appear near the bottom of the list. We can delete duplicate lines by typing

deldups

After we have typed deldups, when we again type dbsize, Grok will tell us that there are only 28 triples, because the one of the duplicated copies was deleted.

Querying about File Inclusions

Perhaps we want to check to see what includes were in the EG program. To do this, we can type

include

This asks Grok to list all the pairs in the include relation, so it outputs:

main.c putint.h

putint.c putint.h

These two pairs tell us that both main.c and putint.c include putint.h.

We can ask the question: What files does main.c include, by typing:

main.c . include

This form is a forward set projector. As a result, Grok will type

putint.h

In this example, we have cut down on the amount of we needed to type by taking advantage of Grok’s forgiving nature. Grok’s set projectors actually expect to receive a set value. To be perfectly correct, and to avoid Grok error messages, we should really have written:

{ “main.c” } . include

But when we are working interactively, we often use these shortcuts.

We can determine what files include putint.h by typing:

include . putint.h

This form is a backward set projector, and again we are minimizing the amount we need to type. As a result, Grok will type

main.c

putint.c

Querying about Function Calls

We might want to know more about calls in the EG program. We could start out by asking the names of all relations, by typing

relnames

As a result, Grok will output the list of names of relations:

funcdef

defloc

include

linkcall

linkref

macrodef

funcdcl

dclloc

vardcl

vardclloc

vardef

vardefloc

librarycall

From this list, we can surmise that the fact base contains two kinds of calls: linkcall and librarycall. From the schema (Figure 5.3) we know that a linkcall is a call that is linked from one user compilation to another, while a librarycall is a call to a library function. To inspect all linkcalls, we type

linkcall

As a result Grok outputs

main putint

To find out about library calls, we type

librarycall

and Grok outputs

putint printf

We have determined that the only linked function call is from main to putint, and the only call to a library is from putint to printf.

We may want to know what things putint is related to. To find out, we type

putint :

It is helpful to think of the colon operator (:) as two eyeballs, looking forward from putint to see what it is related to. As a result, Grok will output:

dclloc putint.h:5

defloc putint.c:4

librarycall printf

From looking at the comments in the CFX schema, we can surmise from this that putint is declared on line 5 of putint.h and defined on line 4 of putint.c. We can also re-affirm that putint calls the library function printf.

We might want to know what depends on the putint function. To find out we type

: putint

It is helpful to think of this as two eyeballs looking at putint. As a result, Grok will output:

main linkcall

putint.h funcdcl

putint.c funcdef

This means that function main calls putint. It also means that putint is declared in putint.h and is defined in putint.c.

The C488 Compiler Example

As our next example, we will consider a larger program. This program is a compiler, written in C, for a Pascal-like language. It has a traditional compiler structure, with a scanner, a parser, a semantics pass and a code generator. We have run C488 through CFX and have obtained the factbase for it. This factbase, factbase.rsf, for C488 is much larger than the factbase for EG. Further, we have run this factbase though the PBS pipeline and as a result, we have created a landscape (structural diagram) for it. You may want to view it on the web by this URL:

http://swag.uwaterloo.ca/pbs/examples/C488/V1/index.html

If you navigate around through this landscape, you will discover that the C488 compiler consists of a number of files, such as parser.c, parser.h, codgen.c, codegen.h, globalvars.h, tokendef.h, etc. You can observe that the only relations shown in the diagram are call, reference, implementby and otherdep. As you may recall, the CFX schema has many more relations than this. Why does the diagram have fewer and different relations? The answer is that a number of Grok scripts manipulated the factbase, combining and renaming relations, to make the diagrams easier to comprehend. Another thing to notice is that the diagram contains higher level entities, such as the semantics subsystem, which group together a number of files. In particular, the semantics subsystem groups these files: semantics.h, semantics.c and semanticext.h.

When landscape diagrams are inspected, it is common to observe surprising dependencies. For example, in the C488 top view, there is a function call from the codegen subsystem to the semantics subsystem. Why should the code generator call the semantic analysis part? It seems that semantic analysis should be completed before code generation starts, so such a call would be wrong or maybe just peculiar. We want to know what actual function call(s) are causing this high level interaction from the code generator to semantics.

Since this kind of question happens repeatedly, we will give it a name: the Where-Did-This-Arrow-Come-From problem. Ideally, we would just press a button on the landscape diagram display, and we would get the answer. The reason there is no button is that there has been considerable abstraction and simplification from the source program to the CFX factbase to the stream used as a basis for drawing the landscape diagrams. However, by using the factbase and Grok, we can work our way backward to answer this question.

The Where-Did-This-Arrow-Come-From Problem

We will consider the following question.

Q1: Where did this arrow from subsystem S to subsystem T come from?

As an example, we will consider the call arrow from S, the code generator subsystem, to T, the semantics subsystem. We will answer this question in two steps.

Step 1. Using the landscape diagram, we will partly answer the question by finding out what files in system S call what files in subsystem T. Given that we have discovered that file F in S calls file G in T, we are now ready for the easier question:

Q2: Where did the arrow from file F to file G come from?

Step 2. We will answer question Q2 by using Grok to query the program’s CFX factbase.

Aside: This example uses a factbase created by the CFX fact extractor. If we were using a different fact extractor, such as CPPX, we would need to modify our approach, but the same general method would apply.

In Step 1, we can use the diagrammatic querying facilities of the landscape drawing program to file out what file(s) in the code generator call what file(s) in semantic analysis. By concentrating first on the codegen subsystem, we can look into its internal diagram to determine that the only file in it that makes calls is codegen.c. We can conclude that F (the file in S) is codegen.c. Then we can look inside the semantics subsystem, where we can immediately discover that the only file that is called is semantics.c. So we can conclude that G (the file in T) is semantics.c. In a more complex example, we would find more than one file F that calls more than one file G. Also, in a more complex example, we may have to spend some time diving deeper into the landscape diagrams to locate the files that are causing the interaction. At any rate, once files F and G are isolated, using the diagrams, we are ready to go to step 2.

To carry out Step 2, we use the program’s factbase.rsf file, which was created by CFX. This factbase contains interactions at the level of functions and global variables. After starting up Grok, we type this command to load the factbase:

getdb factbase.rsf

Once this is loaded, it is a good idea to check its size (using dbsize) and to take a look at the names of its relations (using relnames), just to gain confidence that what we are inspecting makes sense. It is also important to understand the schema reasonably well; it is recommended that you refer to the CFX schema (Figures 5.3 and 5.4). From the schema, we can tell that the funcdef relation records when file F defines function P. We can find the names of all functions declared in file codegen.c by typing:

P := codegen.c . funcdef

To check that F seems reasonable, we can find out how many functions are in it, by typing:

# P

When we do this, Grok tells us that P contains 21 items. Since P is not large, it is a good idea to list these items, by typing:

As a result, Grok outputs all the functions defined in codegen.c:

BBS_Pop

BBS_Push

Code

Code31

Code32

Code46

Code47

FBS_Pop

FBS_Push

FBS_Swap

LES_Pop

LES_Push

Pop_FuncName

Pop_arrayName

Push_FuncName

Push_arrayName

Return_Func_NumArgs

allocate_storage

codegenFinalize

codegenInitialize

generateCode

This list seems reasonable, i.e., it is not surprising that these functions are part of codegen.c.

We will next find all functions in semantics.c. In querying to find these functions, we will ask for all functions that are either declared or defined in semantics.c, by typing:

Q := semantics.c . (funcdef + funcdcl)

We could have written simply

Q := semantics.c . funcdef

However, there is the possibility that semantics.c might contain a function header, without a corresponding body, and that the arrow we are trying to locate might be pointing to this header. So we are using the first (the longer) query. Once we have the value of Q, it is a good idea to see how big it is and to list the names of functions contained in it. We have illustrated these operations for P, so we won’t show them for Q.

Now we are ready to give a query that asks for all linkcalls from functions in codegen.c (in P) to functions in semantics.c (in Q). Here is the query:

id (P) o linkcall o id (Q)

This query makes use of the id function twice. The first id produces self loops on each member of P. These self loops are composed with linkcall so that the result includes only those pairs in linkcall which start with a function in P. These pairs are next composed with id (Q) so that the resulting relation contains only those pairs that end with a function in Q. The result contains all those link calls from a function in codegen.c to a function in semantics.c. As a result of this query, Grok outputs

generateCode ErrorHandler

From this we conclude that there is only one function, generateCode, in codegen.c, which calls a function in semantics.c. The only such called function is ErrorHandler.

To find out more about the generateCode function, we can type

generateCode :

As a result, Grok outputs:

defloc codegen.c:365

deflocend codegen.c:869

linkref errorFile

linkref scanInteger

linkref traceCodeGen

linkref traceFile

linkcall ErrorHandler

linkref PrevScope

linkref ScopeLevel

librarycall fprintf

librarycall strlen

dclloc codegen.h:21

From this list, we can see that generateCode is declared (defloc) on line 21 of codegen.h and is defined (defloc) on line 365 of codegen.c We could make a similar query for ErrorHandler, but instead we will specialize the query to give us the locations of ErrorHandler’s declaration and definition, by typing:

ErrorHandler . (dclloc + defloc)

As a result of this query, Grok outputs:

"semantics.c:138"

This output is surprising, in that it indicates that the Error Handler does not have both a definition and a declaration. To find out more, we can limit our query to try to find only the declaration, by typing:

ErrorHandler . defloc

Intriguingly, Grok produces no output for this query, indicating that ErrorHandler has no declaration. We can conclude that ErrorHandler has a definition (we could write a query to explicitly check this), but no declaration.

What we have learned is that ErrorHandler is programmed with a peculiar style, with a definition but no declaration. While this usage is allowed by the C language, it a frowned upon as being bad style by many programmers. This is particularly suspicious, because we have also discovered that this function is part of semantics.c, but is called from codegen.c. To see how widely ErrorHandler is used, we type:

linkcall . ErrorHandler

As a result Grok outputs only

generateCode

This tells us that ErrorHandler is called (outside of semantics.c, externally via linking) by only the generateCode function. This is further evidence that ErrorHandler is used in a peculiar fashion, and perhaps should be moved to a new location, or perhaps that generateCode should be calling a function other than ErrorHandler.

We will not explore this peculiarity further here. Our goal was to show how to solve the Where-Did-This-Arrow-Come-From Problem. To keep the explanation simple, we have used a straightforward example. In more complex cases, it is easy to get lost in the details. Keeping track of the queries, and avoiding mistakes when typing the queries, is tedious and can be error prone. At the same time, it can also be rewarding, leading to discoveries of unexpected and sometimes incorrect parts of the program. Indeed, it is intellectually interesting to be able to navigate through the structure of a piece of software.

Credits. Gerry Kovan, with assistance from Ric Holt, created the CFX TA schema in Figure 5.3. Ivan Bowman created the E/R Diagram for this schema, in Figure 5.4.

Part 4.

Writing Grok Scripts

The preceding section showed how the Grok interpreter can execute user command interactively. We showed how this facility can be used to determine what functions in file codegen.c call what functions in file semantics.c.

While it is handy to execute Grok commands interpretively, it is also handy to bundle up a set of commands and run them as a script. Figure 6.1 gives a script that solves this same Where-Did-This-Arrow-Come-From Problem.

1 % Find call edges from P to Q

2 getdb "factbase.rsf"

3 P := {"codegen.c"} . funcdef

4 countP := # P

5 put "codegen.c contains ", countP, " function defs"

6 Q := {"semantics.c"} . (funcdef + funcdcl)

7 countQ := # Q

8 put "semantics.c contains ", countQ, " function defs and dcls"

9 put "Here are the linkcalls from codegen.c to semantics.c"

10 id (P) o linkcall o id (Q)

Figure 6.1. Grok Script comefrom.grk

The Grok script in Figure 6.1 contains roughly the same commands that were given in the previous section as interactive command. (The line numbers in the figure are only for reference purposes, and so not actually appear in the script.) In the figure, we have inserted a comment in line 1; as you can see, comments start with a percent % character.

In line 2, we added quotation remarks around factbase.rsf. In lines 3 and 5, we added quotation marks as well as set brackets { … } around the file names. These additions make the commands syntactically correct. In the interactive commands, we avoided some typing by taking syntactic shortcuts, which are, officially speaking, illegal, but since the interpreter is forgiving, we saved ourselves a few keystrokes.

Aside: While it is convenient to take shortcuts when typing interactive commands, you should avoid these and eliminate syntax errors in a Grok script. Grok scripts are hard enough to write without getting confused by error messages. Besides, error messages often help isolate serious bugs in the script.

Lines 5, 8 and 9 illustrate the use of the put statement. It accepts sequences of items separated by commas. These items can be strings, as well as names of variables.

We can run this script from the command line by typing

grok comefrom.grk

When we do this, we will get this output:

codegen.c contains 21 file defs

semantics.c contains 17 file defs and dcls

Here are the linkcalls from codegen.c to semantics.c

generateCode ErrorHandler

There is another way to run this script, namely, we can give an interactive command to invoke it. To do this, we can first start up the Grok interpreter by typing

grok

Once Grok starts up, we can give this command:

exec comefrom.grk

This runs the script and produces the same output as has just been shown. When the script finishes running, Grok will still in interactive mode, so we can continue giving it commands. For example, we could type this simple command:

This command lists the items contained in set P (these are the functions defined in codegen.c). More generally, using the exec command in interactive mode is helpful because it allows us to inspect the values of relations and variables that have been computed by the script. The exec command can also be used in a script, to all another script as a subroutine. For example, the command exec comefrom.grk could appear in a script.

Using Arguments in a Script

Our script solves one particular Where-Did-This-Arrow-Come-From Problem. However, it works only for two particular files, codegen.c and semantics.c and only for a factbase called factbase.rsf. We can generalize it to handle any files and any factbase by using arguments, as is illustrated in Figure 6.2.

1 % Find call edges from P to Q

2 if $n ~= 3 then % Are there 3 arguments?

3 put "Usage: grok comefrom.grk factbaseRsf fromFileName toFileName"

4 put "Finds linkcalls from fromFileName to toFileNamec "

5 put "Author: Ric Holt, Date: 22 Apr 2002"

6 quit

7 end if

9 factbaseRsf := $1

10 fromFileName := $2

11 toFileName := $3

13 getdb factbaseRsf

14 P := { fromFileName } . funcdef

15 countP := # P

16 put fromFileName, " contains ", countP, " function defs"

17 Q := { toFileName } . (funcdef + funcdcl)

18 countQ := # Q

19 put fromFileName, " contains ", countQ, " function defs and dcls"

20 put "Here are the linkcalls from ", fromFileName, " to ", toFileName

21 id (P) o linkcall o id (Q)

Figure 6.2. Grok Script comefrom.grk Rewritten with Arguments

In the Grok script, the notation $n (line 2) gives the number of arguments, and $1, $2, $3, etc. (lines 9 –11) give the values of arguments 1, 2, 3, etc. Grok’s form of if statement is illustrated by lines 2 to 7. The quit statement (line 6) aborts the script.

The script can be run by giving the command

grok comefrom.grk factbase.rsf codegen.c semantics.c

or in interactive mode by typing

exec comefrom.grk factbase.rsf codegen.c semantics.c

Other file names and another name of a factbase can be given as arguments.

If a script is used often, it is good to create a shell script that invokes it, in order to simplify use of the script. For example, we might create a shell script named comefrom that consists of this line:

grok comefrom.grk $@

The form $@ matches all the arguments passed to Grok. This shell script allows us to use comefrom by just typing comefrom factbase.rsf codegen.c semantics.c. For this to be generally useful, the shell script should provide an absolute path (or a path determined by a shell variable) so that comefrom.grk can be found from any directory.

Invoking a Script from a Library

This exec command can be used in another Grok script. When this is done, care must be taken to be sure that the script, comefrom.grk in our example, can be located by Grok. By default, Grok will expect to find the script in the local directory. The SwagKit project, at the University of Waterloo, has a library of commonly used Grok scripts. These are stored in a directory, which is located by a shell variable called SWAGKIT_PATH. Figure 6.3 gives an example of using this shell variable.

execPath := getenv "SWAGKIT_PATH"

linkProcedure := execPath cat "/lib/pipeline/" cat "lift.grp"

exec linkProcedure

Figure 6.3. Using a Shell Variable to Locate a Grok Script

In the figure, the Grok built-in function getenv fetches the value of the SWAGKIT_PATH shell variable. This is assigned to string variable execPath. The complete path to the Grok script is built up by catenating execPath, /lib/pipeline/ and lift.grp. If SWAGKIT_PATH has the value /home/holt/dev_swagkit (as it currently does for the author of this note), the complete path for the Grok script will be

/home/holt/dev_swagkit/lib/pipeline/lift.grp

(Note: The suffix .grp is used by convention to label a Grok procedure.) If you build up a library of Grok scripts, you can use a shell variable to locate your scripts, as is done for SwagKit.

Abstractions by Grok Scripts

A common use of Grok scripts is to read a factbase and to simplify or abstract it. One way of doing this is to combine sets of similar relations into a single relation. Figure 6.4 gives a script, named cfxadddep.grk, which combines all dependencies in a CFX factbase into a single relation called dep. The factbase, with dep added to it, is written out to a new factbase.

1 % Add dependency 'dep' to CFX fact base.

2 if $n ~= 2 then % Are there 2 arguments?

3 put "Usage: grok cfxadddep.grk factbaseRsf newFactBaseRsf"

4 put "Combines all dependencies into a single dependency 'dep'"

5 put "Author: Ric Holt, Date: 22 Apr 2002"

6 quit

7 end if

9 factbaseRsf := $1

10 newFactbaseRsf := $2

12 cfxDep := { "include", "sourcecall", "sourceref", "linkcall", \

13 "linkref", "usemacro", "usetype", "usestruct", "useunion", \

14 "useenum", "librarycall" }

16 for d in cfxDep

17 $ d := EMPTYREL

18 end for

20 getdb factbaseRsf

22 dep := EMPTYREL

23 for d in cfxDep

24 dep := dep + $ d

25 end for

27 putdb newFactbaseRsf

Figure 6.4. Combining Dependency Relations

Lines 12-14 of the script create the set cfxDep, which contains the names of all dependencies in the CFX schema (see Figure 5.3 and 5.4). Note that a backslash \ has been used at the ends of lines 12 and 13 to extend the command to multiple lines.

Lines 16-18 and 23-25 are examples of Grok for statements. These iterate across elements of a set of strings. In each of these, the loop is executes with d assigned each of the values contained in set cfxDep. We know that the iterations will assign d each value from the set, but we are not guaranteed what the ordering will be.

Line 17 contains the statement

$ d := EMPTYREL

The dollar sign $ means that the value of d is to be treated as a relation or variable. For example, when d has the value sourcecall, this statement will effectively be

sourcecall := EMPTYREL

The reason that we have assigned the value EMPTYREL is a bit subtle. This assignment acts like a “declaration” of the relation. This “declaration” needs to precede the statement getdb factbaseRsf, so that, if the factbase does not contain any pairs in the relation, then the relation still exists (is declared), even though it is empty. Without this “declaration”, if there were no pairs in the relation, the relation would not exist, as far as the Grok interpreter is concerned. This is in turn would cause problems in line 24, in which the dep relation is constructed from all the dependency relations. If d has a value (such as sourcecall) in line 24 which isn’t the name of a relation, the Grok interpreter will not know how to execute the line. To avoid this difficulty, any relation being read from an RSF or TA stream, which might not have any pairs in the stream, should be assigned EMPTYREL before reading the stream.

Combining dependencies into “higher level” dependencies is a common way to abstract information in a graph. Another way is to eliminate low-level entities, such as statements, or even functions within files. When this is done, the dependencies from the low level entities are usually lifted (also called raised) to the level of their parent nodes. Grok scripts to do this lifting are part of the PBS and SwagKit pipelines that prepare graphs for visualization.

Other Grok Features

We have introduced key features of Grok by means of examples. We will finish by mentioning a couple more features.

It is common to need to delete some relations or some sets of nodes from a factbase. For example, in the script in Figure 6.4, after adding the dep relation we might want to delete all the relations, which were combined into dep. This can be done as follows

for d in cfxDep

delrel $ d

end for

This loop uses the delrel command to delete relations from the factbase. To remove a set of nodes from the factbase, whose names are in a set nodesToDelete, we can use this statement

delset nodesToDelete

This command removes all triples in the factbase that reference nodes in nodesToDelete.

When debugging a Grok script, the following command can be very helpful:

option echo

This tells the Grok interpreter to trace the script, outputting each command as it is executed. The command

option timing

outputs the time taken to execute each Grok command.

We will now give a final observation about debugging Grok scripts. Grok does not have a compiler, so there is no direct way to see if your script is syntactically legal. However, the following trick works reasonably well: run the script with dummy values for the arguments. This makes Grok run the script (with some bogus messages about bad arguments), but more importantly, gives messages about other syntax errors. You should fix such errors before testing your script on real data.

There are a number of other commands, which are summarized in the following section. This concludes our introduction to Grok.

Part 5.

Summary of Grok Features

1. Grok input/output commands

Grok input commands:

getdb F Read RSF factbase from file F

adddb F Add new tuples to data base from file F

getta F Read TA factbase from file F

addta F Add more TA to date base from file F

getset s [F [D]] Read set s [from file F [in dir D]]

Grok output commands:

putdb F Write RSF data base to file F

relToFile R F Write rel R to file F

appendRelToFile R F Append rel R to file F

putta F Write TA factbase to file F

putset s [F [D]] Write set s [to file F [in dir D]]

put msg {, msg} Print msgs; each is a string or a variable name

The file format for relations for get/put is ASCII triples, eg,

R a b where R relates a to b (this format is called 'RSF').

The file format for sets is blank-separated strings.

Beware that the read/write (compressed) format isn't portable.

When TA is read in, attributes are treated as relations (with @_

added as a prefix). The relations and attributes at the scheme level

have $_ prepended to their names.

2. Grok relation operators:

R1 + R2 Union of relations

R1 - R2 Difference of relations

R1 ^ R2 Intersection of relations

R1 o R2 Relational composition [was previously '*']

inv R Inverse of R

s . R Project set s through rel R [was *]

R . s Project set s through rel R backwards [was *]

R+ Transitive closure of R

R* Reflective transitive closure of R

R1 cmp R2 Compare: cmp is == [or =], ~=, <, <=, >, >=

EMPTYREL The empty relation

# R Cardinality (number of tuples) of R

dom R Domain of R (a set)

rng R Range of R (a set)

ent R Range + domain of R (all entities in R)

id s Identity relation on set s

DOM Domain of all relations in factbase (a set)

RNG Range of all relations in factbase (a set)

ENT Range + domain of all relations (a set)

ID The identity relation (based on ENT)

head R Head of R, eg (x y) in R produces ((x R y) head y)

tail R Tail of R, eg (x y) in R produces (x tail (x R y))

rel2tripleset R Make R into a set, eg (x y) produces (x R y)

tripl2relname Given '(x R y)' produce 'R'

3. Other relation commands:

relnames Set of the names of relations

delrel R Remove R's tuples factbase

listdb Print triples in factbase

sample R [b [n] ] Print sample from R. Begin with

number b (default 1) & print n (10)

slice s Remove tuples that reference other than s

wslice s Remove tuples that do not reference s

delset s Remove tuples from factbase that reference s

delsetset s t Remove tuples with source from s and target from t

deldups Delete any duplicate tuples in factbase

dbsize Number of tuples in factbase

cleardb Delete all relations

4. Grok set operators:

s1 + s2 Union of sets

s1 - s2 Difference of sets

s1 ^ s2 Intersection of sets

s1 cmp s2 Compare: cmp is == [or =], ~=, <, <=, >, >=

# s Cardinality (number of elements) of s

s1 X s2 Cross product s1 X s2 (a relation)

{ e1, e2, ... } Construct set containing e1, e2, ...

EMPTYSET The empty set. You can use { } instead

pick s Pick a value (a string) from s

prefix Q s Set of items in s with prefix Q

suffix s Q Set of items in s with suffix Q

5. Other set commands:

setnames Print the names of set variables

sample s [b [n] ] Print sample from s. Begin with

number b (default 1) & print n (10)

clearsets Delete all sets

addprefix Q s Add prefix Q to items in set s

addsuffix s Q Add suffix Q to items in set s

addprefixrel Q s Add prefix Q to rel names in set s

addsuffixrel s Q Add suffix Q to rel names in set s

addprefixdb Q s Add prefix Q to src & trg of rel names in set s

addprefixsrc Q s Add prefix Q to src of rel names in set s

addsuffixsrc s Q Add suffix Q to src of rel names in set s

addprefixtrg Q s Add prefix Q to trg of rel names in set s

addsuffixtrg s Q Add suffix Q to trg of rel names in set s

delprefix Q s Delete prefix Q to items in set s

delsuffix s Q Delete suffix Q to items in set s

delprefixrel Q s Delete prefix Q to rel names in set s

delsuffixrel s Q Delete suffix Q to rel names in set s

delprefixsrc Q s Delete prefix Q to src of rel names in set s

delsuffixsrc s Q Delete suffix Q to src of rel names in set s

delprefixtrg Q s Delete prefix Q to trg of rel names in set s

delsuffixtrg s Q Delete suffix Q to trg of rel names in set s

duplicate s Creates new set with elements e.dup for

each element e of s, with same

connecting relations as in s

6. Grok number operators:

n1 + n2 Addition

n1 - n2 Subtraction

n1 * n2 Multiplication

n1 / n2 Real division

n1 div n2 Truncating division

n1 mod n2 Modulo

n1 cmp n2 Compare: cmp is == [or =], ~=, <, <=, >, >=

7. Grok string operators

S1 cat S2 Catenate strings S1 and S2

S1 cmp S2 Compare: cmp is == [or =], ~=, <, <=, >, >=

s : Out edges from node s

: s In edges to node s

8. Grok boolean operators:

b1 and b2 Conjunction

b1 or b2 Disjunction

not b Boolean negation

9. Levels of precedence are, from low to high (most binding):

and

cmp (==, ~=, <, <=, >, >=)

+ - cat (additions)

* X

prefixOperators (inv dom rng ent - not ~ # pick $)

suffixOperators (*, +, ie, transitive closure)

Parentheses (...) can be used to specify order of operations

10. Statements:

x := expn Assign expn to variable x (an identifier)

$ x := expn Assign expn to variable whose name is in x

(The 'x = expn' form is considered to be obsolete.)

if expn then Cascaded 'if' statement, with optional

statements 'elsif' and 'else' clauses

{elsif expn then

statements}

[else

statements]

end if

loop Loop statement, stopped by 'exit'

statements

end loop

for e in s Repeat with each element e from set s

statements (exits are supported)

end for

exit Exit from loop/for

exit when expn Exit from loop/for when expn is true

return Return from current Grok script

quit Quit (halt Grok)

11. Other Grok commands:

varnames List of all relation, set and string variables

$ str String 'str' is taken as a variable's name

If value is integer i, evaluates to arg i

$i Value of parameter i, i is 0, 1, 2 ...

$n Number of parameters

getenv s Get environment value s

sysexec s Execute shell command s. Value is return code

expn The value of expn (expression) is printed

exec F p1 ... Exec file F (Grok script) with parameters p1 ...

where t is each member of s

randdb t r n Generate t tuples with r relations and

n nodes.

? Print the help menu

reset Restart (clear relations and variables)

Use '\' in last column to extend commands across line boundaries

12. Setting Grok options:

option Q Set option Q

The options that can be set include:

timing Print time to complete each operation

echo Echo each command (useful for tracing scripts)

debug Interpret 'debug comand' as 'command' (default off)

verbal Print out extra info during commands

progress Show progress on large operations (verbose)

optimize Run fast (default)

dumpsets Dump all sets and set variables (sets.tu)

dumpnames Dump all stored names (storage.tu)

The first of these are turned off using 'no', as in 'option notiming'.

13. Limits on Sizes in Grok

Many/most data sizes are dynamically increased up to the maximum available memory. The dynamically adjusted sizes are for:

Triples

Names

(Along with internal tables using for sorting, etc.)

Sizes which are statically fixed are:

Maximum length of names; strSize=255 bytes, can increase to 255 at most

Maximum tokens on a source Grok line; maxTokens (can be increased)

Introduction to the Grok Language

Table of Contents

Part 1.

Introduction to Grok’s Operators

Relations in Grok

Part 2.

Summary of Grok’s Operators

Part 3.

Grok Queries, Factbases and Schemas

The EG Example: A Very Simple C Program

Querying about File Inclusions

Querying about Function Calls

The C488 Compiler Example

Using Arguments in a Script

Abstractions by Grok Scripts

Other Grok Features