#### Parsing C is a pain, but higher-order functions are great

I’ve recently spent quite a bit of time re-writing a C parser for work. C is a pretty simple language in many ways, but one of its decided obscurities is its syntax for declaring variables. In a language that does this straightforwardly (like Pascal or SML), you can write things like

x : int

to mean that the variable `x`

is of integer type. The syntax is simple: a variable name, followed by a colon, followed by a type.

Superficially, C looks as simple. To do the analogous declaration, you write

int x;

The rule looks simple: type first, and then variable name, with a final semi-colon. But this is not the rule at all! To say that `x`

is an array of 10 integers, you write

int x[10];

The syntax `[10]`

, for indicating that the type is an array, is associated with the variable name, not the base type. You might also wonder what

int x[10][5];

means. Is it five arrays of length ten, or the other way round? (It’s ten arrays of length five.)

Further, in many situations you can declare many variables at once, by separating them with commas. For example,

int x,y,z;

declares three variables, `x`

, `y`

and `z`

and says that they all have integer type. Now, if you write

int x[10], y;

the “array of 10 elements” modifier applies to just the `x`

, so that `y`

is just of type `int`

.

Then there are pointers. To say that variable `x`

is of type “pointer to integer”, you write

int *x;

So, given all that, you might like to have a guess as to what the following declares:

int *x[10];

Given all I’ve told you so far, you can’t know, because you have no way of telling if the array bit of the declaration is at a higher level than the pointer bit. In fact, it *is* the array bit, so that it means that `x`

is an array of ten pointers to integers. If you wanted a pointer to an array of ten integers, you’d have to write:

int (*x)[10];

What does this all have to do with higher-order functions? Well, the natural thing to do when parsing C is to generate values for each syntactic form. Here the interesting forms are the *declarators*, which include pointer, array and variable name information. A declarator is really a function: if you give it a type, then it builds that type into a bigger, more complicated thing. The empty declarator (which you get if you just have a bare variable name) is the identity function: the type you put in is the type you get out. (The types that we’re putting in come from the type that occurs to the left. In the examples above, this has always been `int`

, but it could be `char`

, `float`

or any other C type.)

The higher-order-ness arises when you think about what the square brackets used for arrays must indicate. The basic situtation is that you have some other declarator (*d*), followed by the square brackets containing an array size, ten say. The grammar production looks like:

declarator ← declarator`[`

integer`]`

So *d* is itself a function, and we have to calculate a new function to be the value of the whole thing. In other words, we have a higher-order function: one that takes as parameter a function. If *d* takes *τ* to *d(τ)*, then our new function has to take *τ* to *d(τ ^{10})*, where I have used super-scripting to indicate the array type. I.e.,

*τ*is an array of 10 elements of type

^{10}*τ*.

We can see this in an example. If the syntax is

int x[10][5];

then the innermost *d* function takes a *τ* to *τ ^{10}*. By appending the

`[5]`

syntax, we construct the type function that takes a *τ*to

*(τ*, ten repetitions of a five-element array, and not the other way ’round.

^{5})^{10}The case of the pointer syntax is similar. The syntax looks like

declarator ←`*`

declarator

and if you have a function *d* corresponding to the declarator following the `*`

, then the new function should take a type *τ* to *d(ptr(τ))*. Thus the syntax

*x[10]

which the precedence rules tell us should be viewed as `*(x[10])`

, first generates the function for the array, which takes *τ* to *τ ^{10}*. Then this is combined with the pointer notation, to give a function which takes

*τ*to

*(ptr(τ))*.

^{10}