		       TAXFORM -- TA eXchange FORMAT
		       =============================

		    acacia2pbs Fact Conversion Scripts
		    ==================================


These scripts will convert output from the Acacia extractor into
factbase.rsf, as expected by the PBS system.  It assumes you have already
managed to extract the facts, ma'am, using CCia.


Nutshell for Experts
====================

The file you want to call is acacia2pbs.sh.  This will generate
acacia-factbase.rsf.  Be sure you are in the source dir when you call it,
and that you have already generated an Acacia database for the system.


Requirements to Run these Scripts
=================================

-- Acacia version 1.9 or later (including ksh93, required to run 
   the ksh scripts)
-- nawk or gawk
-- grok version 30.6 or later ("new grok")
-- [optional] PBS version 4.0 or later.  Note that this includes an earlier
   version of grok that won't work for us.

Detail
======

acacia2pbs.sh will convert the facts in an Acacia database into a file
called acacia-factbase.rsf that is 97% compliant with that generated by
cfx/fbgen.  

acacia2pbs.sh is the only file you need to call.  It in turn calls a set of
sh, ksh, awk and grok scripts to perform the conversion.

Although Acacia extracts facts from C++ code, I have assumed only C code
will be used.  I also assume you have already generated an Acacia database
for the system in question.


A Good Guinea Pig
=================

I suggest you take a copy of ctags-3.0.tar.gz that should be in this
directory and unpack it somewhere convenient.  Try to create make.acacia
and make.cfx as described below.  

Next, create the acacia database (entity.db, relation.db etc) and run some
queries using the xwindows program "ciao" as described below.

Finally, generete the factbase.rsf files using cfx and using my extractor
 as described below.
converter and compare the results as described below.

[Look in the "ctags-3.0" subdirectory of the directory where you found this
README for the correct make.acacia and make.cfx as well as the .rsf and .db
files.]


How to Generate an Acacia database
==================================

Go into the directory where the source is stored.  Make sure you have write
permission there.  Make sure you have access to Acacia.  On plg, there is a
copy of the latest version (as of Aug 1999) in ~migod/src.  Otherwise, you
can request permission to download a copy from AT&T research (go to
http://www.research.att.com/software/tools/Acacia/ and follow the
instructions).  Assuming you are using my version, setup these variables:
	setenv CIAOTYPE CC
	setenv CC gcc
	setenv CIAOROOT /u/migod/src/acacia
	setenv LD_LIBRARY_PATH $CIAOROOT/lib
	setenv PATH $CIAOROOT/bin:$CIAOROOT/etc/cia/bin:$PATH

Check to see if CCia or ciao are now on your path.

Go into a directory with a C system source.  Run configure or whatever else
you need to do to get it ready to compile.  Then do this:
	make -n -i > make.acacia
This should give you an idea of the compiler flags that make is sending to
gcc.  Save the -D and -I options and delete the rest (keep the -c option if
you want to process the source files individually).  Then replace "gcc"
with "CCia" and make the last line "CCia *.A"

You will get something like this in your make.acacia file:
	CCia -I. -DHAVE_CONFIG_H -c eiffel.c	# creates eiffel.A
	CCia -I. -DHAVE_CONFIG_H -c entry.c	# creates entry.A
	CCia -I. -DHAVE_CONFIG_H -c vstring.c	# creates vstring.A
	CCia *.A	# creates entity.[db,hix] relationship.[db,hix]

Alternatively, you could do it as one big bang:
	CCia -I. -DHAVE_CONFIG_H *.c

If all is successful, you will end up with four new files: entity.db and
relationship.db (raw text output of the program entities and relations),
plus entities.hix and relations.hix (index tables, I think).

While you might have to go and "fix up" the source a bit, the good news is
that CCia is much more forgiving than cfx.


Give CIAO a Spin
================

Once you have everything hunky dory, you might choose to use the X windows
tool ciao to play around with the system facts.  The tool is not well
documented, alas.  In a nutshell, there are two columns: one for entity1,
and one for entity2.  You can perform queries on entities (only the first
column information is used) or relationships between two entities.  You can
use the columns to constrain each of the entities (if you do a relationship
query on the empty table, you will be asking for all relationships between
all files, functions, variables, types etc. in the system; go for coffee if
so).  

For example, to find all uses of a function in main.c calling a macro
defined anywhere, set kind1=function, file1=main.c, kind2=macro and the
select query -> relationship.  You will get a nicely laid out drawing.  To
see the textual DB view, change "graph mode" to "db mode" on the top and
reselect query -> relationship.  You'll get the idea with enough playing
around.  If you select query -> entity, you'll get a list of all of the
functions defined in main.c.


Using the TAXFORM Converter
===========================

So now we have the acacia database.  To run my extractor converter, perform
the following setup:
	setenv TAXFORM /u/migod/taxform/bin
	setenv PATH ${TAXFORM}:${PATH}
Then run "acacia2pbs.sh" at the prompt level.  A bunch of sh, ksh, awk, and
grok scripts will run leaving you with a file called acacia-factbase.rsf
plus a directory called acacia2pbs that contains the intermediate files (in
case you want to look at them).  You can safely nuke this directory if you
like.  You should now have a factbase.rsf file compatible with what pbs
expects.


Generating an Equivalent factbase using cfx/fbgen
=================================================

Starting with the make.acacia file, above, replace CCia with cfx_cc for
each .c file, then combine them all together and run fbgen.  Using the
above example, do something like this:
        cfx_cc -I. -DHAVE_CONFIG_H -c eiffel.c    # creates .cfx.eiffel.o
        cfx_cc -I. -DHAVE_CONFIG_H -c entry.c     # creates .cfx.entry.o
        cfx_cc -I. -DHAVE_CONFIG_H -c vstring.c   # creates .cfx.vstring.o
        cfx_cc .cfx.*.o				  # creates cfx.mr
	fbgen -R`pwd` cfx.mr factbase.rsf.temp
	sort -u factbase.rsf.temp > factbase.rsf
	rm factbase.rsf.temp

Now open a W I D E xterm and try 
	"sdiff acacia-factbase.rsf factbase.rsf | less"
(Both files should be sorted first!)


Caveats
=======

I have successfully run the source for tar, gmake, and ctags through the
converter with little trouble.  When I tried to compile vim, I got an
internal compiler error, which I bounced off to AT&T ("nasty, let me look
at this for a while", he said).

The big win is that this approach seems to extract more information with
less fuss than cfx.  The downside is that we can't redistribute acacia, tho
we can point people to the acacia download site.

Also, I am not 100% certain I am handling relationships involving
structs/enums/union in the same way as cfx/fbgen.

CCia does quite a bit of work.  For example, if f1 calls f2, then there is
a "calls" relation recorded between the *definitions* of f1 and f2 (ie it
resolves the references to dependencies between the entity defns).  This
meant I had to undo some of the work since factbase.rsf assumes this
resolution hasn't yet been done.


Filename Convention
===================

Following the perceived PBS convention, relative filenames are left whole.
Absolute filenames are converted to <foo.h>.  If the name of a file begins
with "/usr/include", then I have left the rest of the filename intact
(e.g., <stdio.h>, <sys/time.h>).  If the name begins with a slash but does
not start with "/usr/include", then I lop off all but the filename and
surround in "<" and ">" (e.g.,
/.software/arch/egcs+p-1.1/...[deletia].../egcs-2.91.55/include/limits.h is
shortened to <limits.h>).  It is possible that this may create some false
positive relationships if there are multiple includes of files named limits.h
(in fact, this does happen in the system I used as a guinea pig as it also
includes <sys/limits.h>).


Disambiguation, Unique IDs, and Fake Polymorphism
=================================================

One point I claim relative victory on is disambiguation of references to
different but like-named entities.  As far as I am aware, PBS assumes names
are unique.  For example, suppose A.c defines static f and g, and this f
calls this g, *and* B.c also defines static f and g and this f calls this
g.  (This actually happens in the ctags source code, where the author uses
a kind of fake polymorphism style to process different languages similarly.)
Clearly A's f calls A's g and B's f calls B's g and never the twain shall
meet.  However, if you resolve references based only on entity names, then
you can be fooled into assuming A's f calls B's g and B's f calls A's g,
and thus there are file level dependencies from A to B and from B to A
(when in fact there aren't!).

CCia generates unique IDs for each entity and does all of its resolving
based on these IDs (an entity's name is merely an attribute).  In theory,
this should work ... except that it doesn't.  First, it seems to get
confused in the cases described above but only when the second entity is a
type.  Second, sometimes the same entity ends up with two unique
identifiers (which means you can miss relationships that do exist).  Ivan
thinks this might be do to not handling hashing collisions.  The old
extractor "cia" doesn't make these mistakes BTW, but its use is being
discouraged and it extracts less information than CCia.  

What I did is to create my own simple-minded mangling scheme by combining
the filename with a "#" and the entity name.  I use this as a uniqueID for
doing all of my relational calculating, then at the last minute I
regenerate the entity name by awk-ing off "<filename>#".  Thus, no
confusion ... well, not unless there is function overloading.  I could
solve this by adding more detail, like param lists (which are easy to get
at), but then I run into the problem of strings getting too long for grok
... unless I use a clever hashing scheme like acacia does (incorrectly).
Well, maybe I could solve world hunger too if I put my mind to it, but not
today.


Comparing CCia and cfx/fbgen output
===================================

Some notes after having examined output from each tool side-by-side:
-- PBS generates wrong line numbers for locs sometimes.  Acacia seems to
   get them correct always.
-- Both PBS and Acacia do an excellent job of extracting entities,
   including those defines in library include files.
-- Acacia gets a lot more (correct) calls and variable references that PBS
   does not (and should, as they are not tricky).
-- PBS gets some usetype, usestruct, and useenums that Acacia does not (but
   should).
-- there are some other small incompatibilities which don't matter too much
   (eg compiler-defined macros, the "*Initialization* "file" that PBS
   produces).
-- I added an extra tuple called "libraryref func var" for references to
   variables declared in library include files (eg __ctype, errno).
   I could also add the following if there is interest by decommenting a
   couple of lines: dcllocend, vardcllocend, vardeflocend, macrodeflocend,
   typedeflocend.  Not sure if they are useful.

The main advantage to using the Acacia extractor (CCia) is that it gets a
LOT more information than cfx/fbgen does:  (correct) locations for
everything, signatures for functions, detailed dependency information at
the entity-to-entity level (not file-to-entity, tho you can get the file
information).

Of course the other advantage is that Acacia can also do extractions on C++
source code.  I have ignored C++ for now.

Extraction/conversion time isn't bad.  On the source for ctags-3.0 
(10 KLOC), cfx/fbgen took 12 seconds to create factbase.rsf on plg.  
It took acacia 9 seconds to create its database and then 30 seconds for my
scripts to convert it to factbase.rsf (filesize about 5000 lines for both).
Likely the conversion time can be decreased by being more clever about how
I call awk, etc.


Acacia vs Cia
=============

Acacia extracts facts from C++ code, and therefore also from C code,
roughly speaking.  Likely, you would want to use the Acacia extractor,
CCia, as it extracts a more detailed model of the source.  However, there
are a couple of bugs in CCia (it vomits over the vim source code) so you
might still want to use the Cia extractor ("cia") instead (it processes the
vim source just fine).  This will require a bit more work on my part as the
db formats are not quite the same.  I'll save it for a rainy afternoon.  

[Post note: It rained one afternoon, so I went ahead and implemented the
 cia extractor converter too.  Use the script "cia2pbs.sh" to do the
 extraction and follow the above guidelines.  The extractor is names "cia"
 instead of "CCia".  -- MWG Sept 1999]


Bugs, Nits, and Other Small Irritations
=======================================

None that I'm aware of.  The CCia parser breaks sometimes, but that's not
my fault.


--
Mike Godfrey 
migod@plg.uwaterloo.ca
August 1999
