Cforall Design Issues

New Language Features

Declaration Syntax

The C declaration syntax is baroque and complex making the expression of even simple declarations extremely error-prone. There are two reasons for this. First, a single declaration can create several variables with different type, e.g.:
int i, *j, m[10];
creates three variables each working with integer values but each having a different type. This syntax causes declaration errors like:
int * i, j;
which means
int *i; int j;
not
int *i; int *j;
Thus the distribution of the declaration type over the variable list is complex. Nevertheless, this declaration syntax is extremely concise because, in practise, it is very common to have a group of related variables based on a basic type; these variables can all be declared in a single declaration using this syntax. As well, declaration errors resulting from incorrect distribution of type information occur infrequent and are detected by the type system. Finally, this declaration syntax is largely entrenched in the C style of programming. The second declaration problem is much more serious and results from the syntactic nesting used for specifying complex type information, e.g.:
char (*argv[])[];	// often confusingly written as: char **argv
void (*(*p)[10])(int);
The type is formed by nesting the most specific type information around the variable being declared and less specific type information is nested in layers around the more specific type information, using parenthesis to precisely control the nesting. Unfortunately, this syntax confuses even veteran C programmers and results in many errors. Furthermore, it precludes other desirable extensions like named or multiple return values from routines. There is no subtle fix for this problem because the basic idea is inappropriate for specifying declarations, which means that any fix cannot be upwards compatible with C. Other C-like programming languages do not have these declarations problems so clearly the problems can be fixed. The question is how to engineer a fix that has a reasonable chance of acceptance by the C community. First, the C declaration order type-information variable-list is retained for both technical and subjective reasons. The technical reason is that the Pascal/Ada style of variable-list type-information precludes variable initialization for simple one pass compilers, e.g., compare:
int i = 3, j = 4, k = 5;
to:
i : integer = 3;
j : integer = 4;
k : integer = 5;
The brevity of the former declaration follows the C philosophy. Both of the following alternative syntactic forms have problems:
i = 3, j = 4, k = 5 : integer;
i, j, k : integer = 3, 4, 5;
In the first, the compiler does not know the variable type until late in the declaration, which makes checking the validity of the initialized values difficult. In the second, the separation of the initialization values from the variables increases the potential for incorrect initialization. Finally, Pascal/Ada syntactic form deviates completely from current C syntax, making it problematic for C programmers. To handle the nested type problems the following new type, variable and routine syntax are introduced. The new style of declarations may appear together with existing C declarations in the same program block, but cannot be mixed within a specific declaration.

Declarations

In the new declaration syntax, the character * is used to indicate a pointer, square brackets [] are used to represent an array, and parentheses () are used to indicate a routine declaration, which is identical to C declarations. However, unlike C, the type declaration tokens are specified from left to right. For instance, a variable x of type pointer to integer is defined as follows:
NEW		OLD (C)

* int x;	int *x;
Other examples are:
[20] int y;		// array of 20 integers
* [20] double z;	// pointer to array of 20 doubles
[20] * char w;		// array of 20 pointers to char

Routine prototype declarations are specified with the return type enclosed in brackets, e.g.:

[ int ] ( int, * int, [] * char ) f;
[ [] float f ] ( int i, double d ) g;
The parameter types appear in the parenthesis (optional parameter names are allowed), and the return type appears after the parenthesis in square brackets (an optional return variable name is allowed). The parenthesis and brackets must always be specified simply to encourage readability, even when there are no parameters or value returned, as in:
[] () f;
[] (int) f;
[int] () f;
Two differences from C are that an empty parameter list implies no parameters not an unknown parameter list (as in C++), and no return type implies no return value not an integer return value. The C keywords extern and static, which declare externally defined and locally defined routines respectively, are supported:
extern [int] () f;
static [int] (int) g;
These keywords are also supported for variable declarations:
extern int v1;
static * int v2;
The new syntax can be used for declaring fields in a structure or union, e.g.:
NEW				OLD (C)

struct s1 {			struct s2 {
    * int f1;			    int *f1;
    * [* int] (int) f2;		    int *(*f2)(int);
};				};

union u1 {			union u2 {
    [5] int f1;			    int f1[5];
    * int f2;			    int *f2;
};				};

Routine Definition

A new syntax is provided for routine definition (but existing C routine definition is also possible), e.g.:
static [ int x; ] foo( int y; char z1, z2; ) {
    // routine body
}
First, the optional keyword static is equivalent to that in standard C. Second, return values are named and the name is a local variable of the particular return type. The value of the local return variable is automatically returned at routine termination. Third, lists of declarations are separated by a semicolon instead of a comma, which allows parameter declarations to appear more like declarations and one type can declare several parameters. An external declaration of the routine foo is written as follows:
extern [ int ] ( int, char, char ) foo;
Note that the name of the return value is unnecessary.

Returning Values

For new style routines, a value in the return variable is automatically returned when the routine terminates. Therefore, the return statement does not contain an expression, for example:
[int x;] bar() {
    ...
    x = 0;
    return;
}
When the return is encountered, the current value of x is returned to the caller. `Falling off the bottom' of a routine is permitted, as in:
[int x;] foobar() {
    x = 0;
}
In this case, the current value of x, i.e. 0, is returned to the calling routine just as if a return had been encountered.

Garbage Collection

Storage Management

The misuse of heap storage is a significant problem, especially in large, complex, long running programs like operating systems or databases. The use of explicit malloc/free exacerbates the problem but fundamentally storage management is difficult and problematic so attacking a single aspect does not solve the general problem. Given the fundamental complexity of the problem, the best approach is to provide runtime tools to help understand the complex behaviour of the storage management, and in addition, introduce new programming language constructs that reduce heap usage.

Label

Change the scope of label from routine scope to block scope, e.g.:
goto x;	// disallowed with new rule because x is no longer visible
{
    goto x;
  x: 3;
    {
        goto x;
    }
    goto x;
}
goto x;	// disallowed with new rule because x is no longer visible

Advantage: This restriction prevents branching into blocks, which ensures that all variables in a block are well formed during the block's execution, except for intra-block problems, such as:

  goto x;		// branch around initialization
  foo f( 3 );
x:
  f.mem();		// try to call member routine of uninitialized object

Disadvantage: Not backwards compatible, but only affects a form of branching that should seldom appear because of `structured programming'.

Break and Continue Statements

Extend break and continue with a target label to support static multi-level exit [Buhr85], e.g.:
   NEW					   OLD
x:for ( ... ) {				for ( ... ) {
    y:for ( ... ) {			    for ( ... ) {
        z:for ( ... ) {			        for ( ... ) {
            ... break x; ...		            ... goto x; ...
            ... break y; ...		            ... goto y; ...
            ... break z; ...		            ... break; ...
	}				        }
    }					    } y:;
}					} z:;
The inner most loop has three exit points, which cause termination of the three nested loops, respectively.
   NEW					   OLD
x:for ( ... ) {				x:for ( ... ) {
    y:for ( ... ) {			    y:for ( ... ) {
        z:for ( ... ) {			        for ( ... ) {
            ... continue x; ...		            ... goto x; ...
            ... continue y; ...		            ... goto y; ...
            ... continue z; ...		            ... continue; ...
	}				        }
    }					    }
}					}
The inner most loop has three restart points, which cause the next loop iteration to begin, respectively. A break or continue without a target label has the same meaning as the current control structure, i.e., apply to the current loop construct. The target label must be directly associated with a for, while, do or switch statement. When blocks are exited by a break or continue, all variables within the blocks are terminated. break and continue are simply a goto restricted in the following ways:

Advantage: This extension allows static multi-level exits without having to use the goto statement and ties the control flow to the target control structure rather than an arbitrary point in a program. Furthermore, the location of the label at the beginning of the target control structure informs the reader of the program that a multi-level exit is occurring in the body of the control structure. Having the label at the end of the control structure, which might be off the page or screen, is not as convenient.

Type System

Array Types

Make arrays first class types instead of pointers. Actual array declaration:
[10,10,10] int a;		// NEW
int [10][10][10] a;		// OLD
Formal array definition:
void f( [,,] int a);		// NEW
void f( [10,,10] int a );	// NEW
void f( int [10][][10] a );	// OLD
Subscript:
a[1,2,3] = 3;			// NEW
a[1][2][3] = 3;			// OLD
New arrays would have array descriptors, which are passed around to routines. In many cases, it should be possible to pass a new array to a routine expecting an old array, but not vice versa.

Disadvantage: The major incompatibility is that int * no longer matches with int [], for example:

int main( int argc, char **argv ) ...
must be written as:
int main( int argc, [] * [] char argv ) ...

A minor incompatibility is that a comma expression cannot appear directly in a subscript, so that a[1,2,3], which currently means a[3], must be written as a[(1,2,3)]. It is questionable if this a useful feature and it is doubtful many programs use it.

Dynamically Sized Arrays (DSA)

Extend array declarations with dynamic dimensions. Dynamically sized array can occur in blocks and classed, for example:
int size;
cin >> size;
int array[size + 1];	// already supported in g++

class foo {
    const int size;	// const not required
    int array[size + 1];
  public:
    foo( int size ) : size(size) {}	// initialization required
};
Multiple DSAs are allowed in a block or class. In the latter example, size must be initialized when the class is instantiated. The qualifier const implies this, but it does not seem essential to force the use of this qualifier as only the initialization is required; without the const qualifier, size could be used as a variable in the class members.

Advantage: DSAs make use of implicit storage management on the stack for dynamically sized objects rather than explicit storage management on the heap, which reduces the potential for storage management errors. This extension is backwards compatible.

Multi-Valued Contexts

returning multiple values, multiple assignment, tuples

Class

Concurrency

Exceptions

introduce exceptions, throw exceptions not objects and have bound exceptions

Modules

A module provides three important capabilities. First, it provides a mechanism to order initialization of library software. This feature is essential because it allows a program that is composed of multiple modules (and these modules may be composed of other modules) to be properly initialized before the program begins execution. Currently, C++ does not define the relative order of initialization of objects declared with static storage duration in different translation units. As a result, there is no way to ensure that a library is initialized before the objects that depend on it are instantiated (e.g., the runtime system of a garbage collector). This leaves a library implementor with three options in C++:
  1. Forbid the declaration of library objects with static storage duration.
  2. Test repeatedly in the library to ensure that initialization is performed.
  3. Declare an instance of a library initialization object before any declarations that depend on the library in each translation unit, but ensure that only one of the initialization instances actually performs the initialization of the library. The declaration of the initialization object can be made implicit by putting it in the include file for the library, which must be included in each translation unit before library features are used.
The first solution is very restrictive, the second is inefficient, and the third is a C++ idiom that is unobvious to a library implementor.

Second, modules provide a mechanism to collect interface information into units smaller than a UNIX file. That is to say, multiple module interfaces can appear in a single UNIX file.

Third, modules provide the ability to compile an interface so that is it unnecessary to reparse the interface as for textual substitution. This makes compiling programs that contain even a large number of interfaces efficient. This is similar to the same way that separate compilation improves the efficiency of the development of a large program.

Priority of * and .

The raison d'etre for the -> operator in C is that the priority of operators * and . are incorrect. The priority of operator . is lower than the priority of operator *, which means *p.f executes as *(p.f) instead of (*p).f. This behaviour is counter-intuitive to all but hard-core C programmers and is different from all modern imperative programming languages. Instead of switching the priority of operators * and ., the operator -> was introduced to mean the two operators but performed with the correct priority, that is, p->f implies (*p).f.

Advantage: This change corrects a long standing design error in C and would pave the way for the eventual removal of the unnecessary operator -> from the language.

Disadvantage: This change is not backwards compatible. However, it remains to be seen how many programs use the form *p.f to mean *(p.f).

Does this help with `smart' pointer?

Reference Type

A reference is a pointer that is automatically de-referenced, i.e., it is unnecessary to explicitly de-reference a reference variable to refer to the value it is pointing at.
int i;
int &r = i;	=>	int *const r = &i;
r = 2;		=>	*r = 2;
Not particularly useful in this context except to give multiple names to the same storage. The problem is that references in C++ are severely restricted. Only one level of automatic de-reference is done, and a reference variable must be initialized on declaration. Useful for call by reference so that it is unnecessary to always have to prefix the parameter with * and can use dot notion instead of -> notion.
string substr(string &s);		string substr(string *s);
   s = ... s ...		vs	    *s = ... *s ...

Assignment

No Routine Parameters

Currently, C requires a routine with no parameters to have parameter type void. Changes this to the C++ ...

Definition Before Use

C requires definition before use, which allows the compiler to make a single scan of a program, which is more efficient. However, it requires programmers to fuss with the source text in a non-productive way. In this instance, people time is more expensive than computer time. For example, there may be conflicting source code placement issues, such as placement for inlining and placement for reference to types. Let the compiler deal with this problem, when possible.