Monday, 11 September 2006

Higher-order functions win the day!

Listening to:

Chopin, 24 Preludes, op. 28. Played by Jorge Bolet (Decca 448 985-2).

Parsing C is a pain, but higher-order functions are great

I’ve recently spent quite a bit of time re-writing a C parser for work. C is a pretty simple language in many ways, but one of its decided obscurities is its syntax for declaring variables. In a language that does this straightforwardly (like Pascal or SML), you can write things like

   x : int

to mean that the variable x is of integer type. The syntax is simple: a variable name, followed by a colon, followed by a type.

Superficially, C looks as simple. To do the analogous declaration, you write

   int x;

The rule looks simple: type first, and then variable name, with a final semi-colon. But this is not the rule at all! To say that x is an array of 10 integers, you write

   int x[10];

The syntax [10], for indicating that the type is an array, is associated with the variable name, not the base type. You might also wonder what

   int x[10][5];

means. Is it five arrays of length ten, or the other way round? (It’s ten arrays of length five.)

Further, in many situations you can declare many variables at once, by separating them with commas. For example,

   int x,y,z;

declares three variables, x, y and z and says that they all have integer type. Now, if you write

   int x[10], y;

the “array of 10 elements” modifier applies to just the x, so that y is just of type int.

Then there are pointers. To say that variable x is of type “pointer to integer”, you write

   int *x;

So, given all that, you might like to have a guess as to what the following declares:

   int *x[10];

Given all I’ve told you so far, you can’t know, because you have no way of telling if the array bit of the declaration is at a higher level than the pointer bit. In fact, it is the array bit, so that it means that x is an array of ten pointers to integers. If you wanted a pointer to an array of ten integers, you’d have to write:

   int (*x)[10];

What does this all have to do with higher-order functions? Well, the natural thing to do when parsing C is to generate values for each syntactic form. Here the interesting forms are the declarators, which include pointer, array and variable name information. A declarator is really a function: if you give it a type, then it builds that type into a bigger, more complicated thing. The empty declarator (which you get if you just have a bare variable name) is the identity function: the type you put in is the type you get out. (The types that we’re putting in come from the type that occurs to the left. In the examples above, this has always been int, but it could be char, float or any other C type.)

The higher-order-ness arises when you think about what the square brackets used for arrays must indicate. The basic situtation is that you have some other declarator (d), followed by the square brackets containing an array size, ten say. The grammar production looks like:

declarator ← declarator[integer]

So d is itself a function, and we have to calculate a new function to be the value of the whole thing. In other words, we have a higher-order function: one that takes as parameter a function. If d takes τ to d(τ), then our new function has to take τ to d(τ10), where I have used super-scripting to indicate the array type. I.e., τ10 is an array of 10 elements of type τ.

We can see this in an example. If the syntax is

   int x[10][5];

then the innermost d function takes a τ to τ10. By appending the [5] syntax, we construct the type function that takes a τ to 5)10, ten repetitions of a five-element array, and not the other way ’round.

The case of the pointer syntax is similar. The syntax looks like

declarator ← *declarator

and if you have a function d corresponding to the declarator following the *, then the new function should take a type τ to d(ptr(τ)). Thus the syntax


which the precedence rules tell us should be viewed as *(x[10]), first generates the function for the array, which takes τ to τ10. Then this is combined with the pointer notation, to give a function which takes τ to (ptr(τ))10.