<& headers.test , 'title' => 'c2eng, the How To' , sidebar_subtopic => 'perl:RecDescent' &>

This is the pre-edited version of my article in the Perl Journal.

I've been loosely following the litigation regarding the DeCSS code (more on that later) and I am interested in the issue of whether computer source code should be regarded as a form of protected expression. I side firmly in the affirmative, so I set out to write a program that takes ANSI-C code and translates it to gramatically correct descriptive English sentences. That means taking the C preprocessing statement: "#include " and turning it into "This program makes use of the system file math.h." Or turning "#define Pi 3.14159265358979" and turning it into "Note: we define Pi to mean '3.14159265358979'" The global declaration "int I,J,K;" becomes "Specifying the type integer, allocate the variables 'I', 'J', and 'K'."

And so on and so forth. Needless to say, the tool I chose for this task is Perl. I could have tried with yacc, but I knew more Perl than yacc.

The first thing I had to do was ask around Usenet for suggestions as to how to go about go about this task. Several folks pointed me in the direction of the Parse::RecDescent module. Having downloaded the module and podfile, I looked up more about yacc and got cozy with the C grammar at the appendix of Kernighan and Ritchie Second Edition. I also looked at the Parse::Yapp pod file. Unfortunately, I hopped around these sources inconsistently, because of the special constraints offered by the problem.

I concluded (and still stand by the conclusion) that Parse::RecDescent is te best tool for the task. My approach was to start with a loosely altered version of the K&R grammar, and run it against a sample of C code, modifying as I went along. That was a singularly bad idea. I'll get back to why in a little bit, but this is where we should come to an intro to RecDescent.

Parse::RecDescent offers several advantages over other parsers for this task, and here is why: a typical entry in a grammar looks like this:

<%text>
	Rule : Subrule1 Subrule2 
	Subrule1 : Subrule3 

But with Parse::RecDescent, you can modify the rule to do this:
<%text>
	Rule : Subrule1 Subrule2 
	{ $return = $item{subrule2} . $item{subrule1}; } 

This lets us assemble a return string ($return) in any Perly way we wish and return it to the superrule that called Rule. This was exactly the flexibility that was needed here. It lets us throw in various strings like "specified as the type" in the right places to turn a C statement into an English one.

Now, at the highest level of parsing in our case, we can run into a global variable declaration, a function prototype or definition, a comment, or a preprocessor directive. So our first rule looks like this:

<%text>		
startrule : (
	preproc {print $item{preproc};}
	| comment {print $item{comment};}
	| global_var_declaration {print $item{global_var_declaration};} 
	| function_definition {print $item{function_definition};}
	| function_prototype {print $item{function_prototype};}
)(s)

This tells RecDescent to collect the return string from whatever parses correctly, and then just print it out. We can only do the print statements at the top level because during the parsing we will parse many things successfully as part of unsucessful parsings. The parentheses around the subrules are here to say that we will find more than one "startrule" in our input text.

By this point I had posted several questions to comp.lang.perl.misc and gotten assistance from Damian Conway and Randal Schwartz, (by 'assistance' I mean being walked by the hand through a lot of issues). So I posted one more time and Damian sent me a C grammar that had been written for RecDescent. Since the grammar had been written without interruption by the things I was doing, it was far better for this purpose.

So if we weren't using recursive descent, we would have rules in our grammar looking like this:

<%text>
additive_expression : 
	multiplicative_expression 
	| additive_expression '+' multiplicative_expression 
	| additive_expression '-' multiplicative_expression 

multiplicative_expression : 
	cast_expression 
	| multiplicative_expression '*' cast_expression 
	| multiplicative_expression '/' cast_expression 
	| multiplicative_expression '%' cast_expression 

The rule for a multiplicative expression is the last in a long series covering the logical, bitwise, and other operators. But RecDescent will refuse to work with the rules as we have them here. So, let's concentrace on the additive expression. To make the rule logically equivalent but not left-recursive,we have the feature, which speeds up the parsing. This rule is equivalent to the previous, but works faster:
<%text>
additive_expression : 
	 
add_op : '-' | '+'

If life were simple, we'd stop here. But by translating the entire C grammar to this pattern we have a grammar that can be overwhelmed by the following statement from a legally-troubled piece of source code:
<%text>
	o_lfsr0 = (((((((lfsr0>>8)^lfsr0)>>1)^lfsr0)>>3)^lfsr0)>>7);

Using this set of parentheses lets the instruction fly fast at run time. But our rule for an expression in parentheses is several layers below a multiplicative expression. By adding to the rules a print statement, I got to see what was hapening:
<%text>
primary_expression : [other stuff] 
	| '(' expression 
# note we go all the way back to the top of the grammar here. 
	')' {print STDERR "$return\n";} 

The constant stream of text to standard error showed a beautiful pattern. For every twelve or so times our grammar parsed the innermost expression, it parsed the innermost two. For every twelve or so times it got the innermost two, it got the innermost three. Unfortunately, beauty is not our main goal here. Luckily, since we're not writing an actual C compiler, we have a little leeway. Think back to seventh grade. Your teacher is dictating a long equation and you write it on the board. While you're taking the dictation, you do not need to care about operator precedence. So, in this context, we don't either.

So after much cutting and pasting, the ten rules for operator expressions were condensed to two rules. Here's one rule:

<%text>
rel_mul_add_ex_op :
	'+' {$return = ' plus ';}
	| '-' {$return = ' minus '; }
	| '*' {$return = ' times '; }
	| '/' {$return = ' divided by '; }
	| '%' {$return = ' modulo '; }
	|'<<' {$return = ' shifted left by ';}
	| '>>' {$return = ' shifted right by ';}
	| '>='  {$return = ' checked to be greater than or equal to ';}
	| "<="  {$return = ' checked to be less than or equal to ';}
	| '>' {$return = ' checked to be greater than ';}
	| '<'  {$return = ' checked to be less than ';}
	
rel_add_mul_shift_expression : 
	

So now's the time for another digression. In 1557 a scholar named Robert Recorde writing a treatise on arithmetic wrote this famous quote:
To avoide the tediouse repetition of these woordes: is equalle to: I will settle as I doe often in woorke use, a paire of paralleles, or gemowe [twin] lines of one length e: =, bicause noe .2. thynges, can be moare equalle.
Yes, mathematical notation, our "universal language", began with the impatience of a man writing an entire book by hand, in a dank unlighted room, with an inkwell and a quill. Since our mathematical symbols are actually contractions for normal English woordes, we can take any equation and construct a gramatically correct sentence with it.

For example, "the square of the hypotenuse is equal to the square of the first leg plus the square of the second leg." This should explain to you the return values I'm using in the rule for operators. Of course, we can only turn an equation into a sentence if its variables stand for something. Alternatively, we can redefine "English" to include nouns, verbs, adjectives, adverbs, prepositions, conjunctions, interjections, and variable names. Compared to what happens in corporate press releases, this is a far milder assault on the purity of the language.

Having figured out how to descend recursively through a left-associative grammar, and translate algebraic equations back to the monotone of a seventh grade teacher, we have to put in the Perl code to construct it. Enter %item, and $return. At the end of every rule we have to put in an Action, in which we assign a value to $return. The simplest case of this is above. Now, for the expression rule:

<%text>
rel_add_mul_shift_expression : 
	
{
	$return = join('',@{$item[1]}); 
}

For the rule "foo : bar baz" we have the @item and %item variables with which to build $return. We can find out the $return value from the bar subrule by looking at $item{bar} or $item[1]. The baz rule gets $item[2] and $item{baz}. In the case of the rule above, we have only one subrule (the entire construct), and leftop is a special rule that returns a reference to an anonymous array containing the $return values for however many cast_expressions and operators get parsed. Since the hard work of building the English for the expression is done in the layers below, here we just dereference the array and join() it to a single string.

That turns this:

<%text>
  prod[2][1] = a[2][1]*b[1][1]+a[2][2]*b[2][1];

to this: Assign to array prod's element at address (2,1) the value "array a's element at address (2,1) times array b's element at address (1,1) plus array a's element at address (2,2) times array b's element at address (2,1)".

Now, for the more elaborate stuff.

<& footers.comp &>