<& headers.test , 'title' => 'c2eng, the How To' ,
	sidebar_subtopic => 'perl:RecDescent'
&>
<p>
<i>This is the pre-edited version of
my article in the <a href="http://www.tpj.com">
Perl Journal.</a>
</i>
</p>

<p>
I've been loosely following the litigation regarding the 
DeCSS code (more on that later) and I am 
interested in the issue of whether
computer source code should be regarded as a form of 
protected expression. I side firmly in the affirmative, 
so I set out to write a program that takes ANSI-C 
code and translates it to gramatically correct descriptive
English sentences. That means taking the C preprocessing
statement: "#include <math.h>" and turning it into 
"This program makes use of the system file math.h."
Or turning "#define Pi 3.14159265358979" and turning it
into "Note: we define Pi to mean '3.14159265358979'"
The global declaration "int I,J,K;" becomes "Specifying  
the type integer, allocate the variables 'I', 'J', and 'K'."
</p>

<p>

And so on and so forth. Needless to say, 
the tool I chose for this task is Perl.
I could have tried with yacc, but I 
knew more Perl than yacc.
</p>

<p>

The first thing I had to do was ask around Usenet 
for suggestions as to how to go about go about this task.
Several folks pointed me in the direction of the 
Parse::RecDescent module. Having downloaded the module
and podfile, I looked up more about yacc and 
got cozy with the C grammar at the appendix of
Kernighan and Ritchie Second Edition. I also looked
at the Parse::Yapp pod file.  Unfortunately,
I hopped around these sources inconsistently, because
of the special constraints offered by the problem.
</p>

<p>

I concluded (and still stand by the conclusion)
that Parse::RecDescent is te best tool for the task.
My approach was to start with a loosely altered
version of the K&R grammar, and run it against a sample of C code,
modifying as I went along. That was a singularly bad idea.
I'll get back to why in a little bit, but this is where
we should come to an intro to RecDescent. 
</p>

<p>

Parse::RecDescent offers several advantages over other 
parsers for this task, and here is why: a typical entry
in a grammar looks like this:
<pre>
<%text>
	Rule : Subrule1 Subrule2 
	Subrule1 : Subrule3 
</%text>
</pre>
But with Parse::RecDescent, you can modify the rule
to do this:
<pre>
<%text>
	Rule : Subrule1 Subrule2 
	{ $return = $item{subrule2} . $item{subrule1}; } 
</%text>
</pre>
This lets us assemble a return string ($return) in
any Perly way we wish and return it to the superrule
that called Rule. This was exactly the flexibility that was needed
here. It lets us throw in various strings like "specified as the type"
in the right places to turn a C statement into an English one.
</p>

<p>

Now, at the highest level of parsing in our case, we can run into
a global variable declaration, a function prototype or definition,
a comment, or a preprocessor directive. So our first rule looks 
like this:
<pre>
<%text>		
startrule : (
	preproc {print $item{preproc};}
	| comment {print $item{comment};}
	| global_var_declaration {print $item{global_var_declaration};} 
	| function_definition {print $item{function_definition};}
	| function_prototype {print $item{function_prototype};}
)(s)
</%text>
</pre>
This tells RecDescent to collect the return string from 
whatever parses correctly, and then just print it out.
We can only do the print statements at the top level 
because during the parsing we will parse many things successfully
as part of unsucessful parsings. The parentheses around the subrules
are here to say that we will find more than one "startrule" in our
input text.
</p>

<p>


By this point I had posted several questions to comp.lang.perl.misc
and gotten assistance from Damian Conway and Randal Schwartz, 
(by 'assistance' I mean being walked by the hand through a lot
of issues). So I posted one more time and Damian sent me a C grammar
that had been written for RecDescent. Since the grammar had been written
without interruption by the things I was doing, it was far better
for this purpose. 
</p>

<p>

So if we weren't using recursive descent, we would have rules in our
grammar looking like this: 
<pre>
<%text>
additive_expression : 
	multiplicative_expression 
	| additive_expression '+' multiplicative_expression 
	| additive_expression '-' multiplicative_expression 

multiplicative_expression : 
	cast_expression 
	| multiplicative_expression '*' cast_expression 
	| multiplicative_expression '/' cast_expression 
	| multiplicative_expression '%' cast_expression 
</%text>
</pre>
The rule for a multiplicative expression is the <em>last</em>
in a long series covering the logical, bitwise, and other 
operators. But RecDescent will refuse to work with the rules as 
we have them here. So, let's concentrace on the additive expression.
To make the rule logically equivalent but not left-recursive,we have 
the <leftop: > feature, 
which speeds up the parsing. This rule is 
equivalent to the previous, but works faster:
<pre>
<%text>
additive_expression : 
	<leftop: multiplicative_expression add_op multiplicative_expression> 
add_op : '-' | '+'
</%text>
</pre>
If life were simple, we'd stop here. But by translating the entire
C grammar to this pattern we have a grammar that can be overwhelmed 
by the following statement from a legally-troubled piece of 
source code: 
<pre>
<%text>
	o_lfsr0 = (((((((lfsr0>>8)^lfsr0)>>1)^lfsr0)>>3)^lfsr0)>>7);
</%text>
</pre>
Using this set of parentheses lets the instruction fly fast
at run time. But our rule for an expression in parentheses is 
several layers below a multiplicative expression. By adding to the rules
a print statement, I got to see what was hapening:
<pre>
<%text>
primary_expression : [other stuff] 
	| '(' expression 
# note we go all the way back to the top of the grammar here. 
	')' {print STDERR "$return\n";} 
</%text>
</pre>
The constant stream of text to standard error 
showed a beautiful pattern. For every twelve or so times 
our grammar parsed the innermost expression, it parsed the 
innermost two. For every twelve or so times it got the
innermost two, it got the innermost three. Unfortunately,
beauty is not our main goal here. Luckily, since we're
not writing an actual C compiler, we have a little leeway.
Think back to seventh grade. Your teacher is dictating a long
equation and you write it on the board. While you're taking the
dictation, you do not need to care about operator precedence.
So, in this context, we don't either. 
</p>

<p>

So after much cutting and pasting, the ten rules for 
operator expressions were condensed to two rules. Here's
one rule:
<pre>
<%text>
rel_mul_add_ex_op :
	'+' {$return = ' plus ';}
	| '-' {$return = ' minus '; }
	| '*' {$return = ' times '; }
	| '/' {$return = ' divided by '; }
	| '%' {$return = ' modulo '; }
	|'<<' {$return = ' shifted left by ';}
	| '>>' {$return = ' shifted right by ';}
	| '>='  {$return = ' checked to be greater than or equal to ';}
	| "<="  {$return = ' checked to be less than or equal to ';}
	| '>' {$return = ' checked to be greater than ';}
	| '<'  {$return = ' checked to be less than ';}
	
rel_add_mul_shift_expression : 
	<leftop: cast_expression rel_mul_add_ex_op cast_expression>
</%text>
</pre>
So now's the time for another digression. In 1557
a scholar named Robert Recorde writing a treatise 
on arithmetic wrote this famous quote: 
<blockquote>
To avoide the tediouse repetition of these woordes: is equalle to: 
I will settle as I doe often in woorke use, a paire of paralleles, 
or gemowe [twin] lines of one length e: =, bicause noe .2. 
thynges, can be moare equalle. 
</blockquote>
Yes, mathematical notation, our "universal language",
began with the impatience of a man writing an entire book
by hand, in a dank unlighted room, with an inkwell 
and a quill. Since our mathematical symbols are actually 
contractions for normal English woordes, we can take any
equation and construct a gramatically correct sentence with it.
</p>

<p>

For example, "the square of the hypotenuse is equal to 
the square of the first leg plus the square of the 
second leg." This should explain to you the return values I'm
using in the rule for operators. Of course, 
we can only turn an equation into a sentence if 
its variables stand for something. Alternatively, we can redefine
"English" to include nouns, verbs, adjectives, adverbs, 
prepositions, conjunctions, interjections, and variable names.
Compared to what happens in corporate press releases, this is
a far milder assault on the purity of the language.
</p>

<p>

Having figured out how to descend recursively through 
a left-associative grammar, and translate algebraic 
equations back to the monotone of a seventh grade teacher, 
we have to put in the Perl code to construct it. 
Enter %item, and $return. At the end of every rule
we have to put in an Action, in which we assign a value to 
$return. The simplest case of this is above. Now, for
the expression rule:  
<pre>
<%text>
rel_add_mul_shift_expression : 
	<leftop: cast_expression rel_mul_add_ex_op cast_expression>
{
	$return = join('',@{$item[1]}); 
}
</%text>
</pre>
For the rule "foo : bar baz" we have the @item and %item 
variables with which to build $return. We can find out the 
$return value from the bar subrule by looking at 
$item{bar} or $item[1]. The baz rule gets $item[2]
and $item{baz}. In the case of the rule above, we
have only one subrule (the entire <leftop:...> construct),
and leftop is a special rule that returns a reference
to an anonymous array containing the $return values
for however many cast_expressions and operators get parsed.
Since the hard work of building the English for the 
expression is done in the layers below, here we just 
dereference the array and join() it to a single string.
</p>

<p>

That turns this: 
<pre>
<%text>
  prod[2][1] = a[2][1]*b[1][1]+a[2][2]*b[2][1];
</%text>
</pre>
to this: 
<tt>
Assign to array prod's element at address (2,1) the value "array a's element at address (2,1) times array b's element at address (1,1) plus array a's element at address (2,2) times array b's element at address (2,1)".
</tt>
</p>
<p>
<a href="c2eng_a.html">Now, for the more elaborate stuff.</a>
</p>
<& footers.comp &>