<& headers.test , 'title' => 'c2eng, the How To' ,
	sidebar_subtopic => 'perl:RecDescent'
&>
<p>
<i>This is the pre-edited version of
my article in the <a href="http://www.tpj.com">
Perl Journal.</a>
</i>
</p>

<p>
I've been loosely following the litigation regarding the 
DeCSS code (more on that later) and I am 
interested in the issue of whether
computer source code should be regarded as a form of 
protected expression. I side firmly in the affirmative, 
so I set out to write a program that takes ANSI-C 
code and translates it to gramatically correct descriptive
English sentences. That means taking the C preprocessing
statement: "#include <math.h>" and turning it into 
"This program makes use of the system file math.h."
Or turning "#define Pi 3.14159265358979" and turning it
into "Note: we define Pi to mean '3.14159265358979'"
The global declaration "int I,J,K;" becomes "Specifying  
the type integer, allocate the variables 'I', 'J', and 'K'."
</p>

<p>

And so on and so forth. Needless to say, 
the tool I chose for this task is Perl.
I could have tried with yacc, but I 
knew more Perl than yacc.
</p>

<p>

The first thing I had to do was ask around Usenet 
for suggestions as to how to go about go about this task.
Several folks pointed me in the direction of the 
Parse::RecDescent module. Having downloaded the module
and podfile, I looked up more about yacc and 
got cozy with the C grammar at the appendix of
Kernighan and Ritchie Second Edition. I also looked
at the Parse::Yapp pod file.  Unfortunately,
I hopped around these sources inconsistently, because
of the special constraints offered by the problem.
</p>

<p>

I concluded (and still stand by the conclusion)
that Parse::RecDescent is te best tool for the task.
My approach was to start with a loosely altered
version of the K&R grammar, and run it against a sample of C code,
modifying as I went along. That was a singularly bad idea.
I'll get back to why in a little bit, but this is where
we should come to an intro to RecDescent. 
</p>

<p>

Parse::RecDescent offers several advantages over other 
parsers for this task, and here is why: a typical entry
in a grammar looks like this:
<pre>
	Rule : Subrule1 Subrule2 
	Subrule1 : Subrule3 
</pre>
But with Parse::RecDescent, you can modify the rule
to do this:
<pre>
	Rule : Subrule1 Subrule2 
	{ $return = $item{subrule2} . $item{subrule1}; } 
</pre>
This lets us assemble a return string ($return) in
any Perly way we wish and return it to the superrule
that called Rule. This was exactly the flexibility that was needed
here. It lets us throw in various strings like "specified as the type"
in the right places to turn a C statement into an English one.
</p>

<p>

Now, at the highest level of parsing in our case, we can run into
a global variable declaration, a function prototype or definition,
a comment, or a preprocessor directive. So our first rule looks 
like this:
<pre>		
startrule : (
	preproc {print $item{preproc};}
	| comment {print $item{comment};}
	| global_var_declaration {print $item{global_var_declaration};} 
	| function_definition {print $item{function_definition};}
	| function_prototype {print $item{function_prototype};}
)(s)
</pre>
This tells RecDescent to collect the return string from 
whatever parses correctly, and then just print it out.
We can only do the print statements at the top level 
because during the parsing we will parse many things successfully
as part of unsucessful parsings. The parentheses around the subrules
are here to say that we will find more than one "startrule" in our
input text.
</p>

<p>


By this point I had posted several questions to comp.lang.perl.misc
and gotten assistance from Damian Conway and Randal Schwartz, 
(by 'assistance' I mean being walked by the hand through a lot
of issues). So I posted one more time and Damian sent me a C grammar
that had been written for RecDescent. Since the grammar had been written
without interruption by the things I was doing, it was far better
for this purpose. 
</p>

<p>

So if we weren't using recursive descent, we would have rules in our
grammar looking like this: 
<pre>
additive_expression : 
	multiplicative_expression 
	| additive_expression '+' multiplicative_expression 
	| additive_expression '-' multiplicative_expression 

multiplicative_expression : 
	cast_expression 
	| multiplicative_expression '*' cast_expression 
	| multiplicative_expression '/' cast_expression 
	| multiplicative_expression '%' cast_expression 
</pre>
The rule for a multiplicative expression is the <em>last</em>
in a long series covering the logical, bitwise, and other 
operators. But RecDescent will refuse to work with the rules as 
we have them here. So, let's concentrace on the additive expression.
To make the rule logically equivalent but not left-recursive,we have 
the <leftop: > feature, 
which speeds up the parsing. This rule is 
equivalent to the previous, but works faster:
<pre>
additive_expression : 
	<leftop: multiplicative_expression add_op multiplicative_expression> 
add_op : '-' | '+'
</pre>
If life were simple, we'd stop here. But by translating the entire
C grammar to this pattern we have a grammar that can be overwhelmed 
by the following statement from a legally-troubled piece of 
source code: 
<pre>
	o_lfsr0 = (((((((lfsr0>>8)^lfsr0)>>1)^lfsr0)>>3)^lfsr0)>>7);
</pre>
Using this set of parentheses lets the instruction fly fast
at run time. But our rule for an expression in parentheses is 
several layers below a multiplicative expression. By adding to the rules
a print statement, I got to see what was hapening:
<pre>
primary_expression : [other stuff] 
	| '(' expression 
# note we go all the way back to the top of the grammar here. 
	')' {print STDERR "$return\n";} 
</pre>
The constant stream of text to standard error 
showed a beautiful pattern. For every twelve or so times 
our grammar parsed the innermost expression, it parsed the 
innermost two. For every twelve or so times it got the
innermost two, it got the innermost three. Unfortunately,
beauty is not our main goal here. Luckily, since we're
not writing an actual C compiler, we have a little leeway.
Think back to seventh grade. Your teacher is dictating a long
equation and you write it on the board. While you're taking the
dictation, you do not need to care about operator precedence.
So, in this context, we don't either. 
</p>

<p>

So after much cutting and pasting, the ten rules for 
operator expressions were condensed to two rules. Here's
one rule:
<pre>
rel_mul_add_ex_op :
	'+' {$return = ' plus ';}
	| '-' {$return = ' minus '; }
	| '*' {$return = ' times '; }
	| '/' {$return = ' divided by '; }
	| '%' {$return = ' modulo '; }
	|'<<' {$return = ' shifted left by ';}
	| '>>' {$return = ' shifted right by ';}
	| '>='  {$return = ' checked to be greater than or equal to ';}
	| "<="  {$return = ' checked to be less than or equal to ';}
	| '>' {$return = ' checked to be greater than ';}
	| '<'  {$return = ' checked to be less than ';}
	
rel_add_mul_shift_expression : 
	<leftop: cast_expression rel_mul_add_ex_op cast_expression>
</pre>
So now's the time for another digression. In 1557
a scholar named Robert Recorde writing a treatise 
on arithmetic wrote this famous quote: 
<blockquote>
To avoide the tediouse repetition of these woordes: is equalle to: 
I will settle as I doe often in woorke use, a paire of paralleles, 
or gemowe [twin] lines of one length e: =, bicause noe .2. 
thynges, can be moare equalle. 
</blockquote>
Yes, mathematical notation, our "universal language",
began with the impatience of a man writing an entire book
by hand, in a dank unlighted room, with an inkwell 
and a quill. Since our mathematical symbols are actually 
contractions for normal English woordes, we can take any
equation and construct a gramatically correct sentence with it.
</p>

<p>

For example, "the square of the hypotenuse is equal to 
the square of the first leg plus the square of the 
second leg." This should explain to you the return values I'm
using in the rule for operators. Of course, 
we can only turn an equation into a sentence if 
its variables stand for something. Alternatively, we can redefine
"English" to include nouns, verbs, adjectives, adverbs, 
prepositions, conjunctions, interjections, and variable names.
Compared to what happens in corporate press releases, this is
a far milder assault on the purity of the language.
</p>

<p>

Having figured out how to descend recursively through 
a left-associative grammar, and translate algebraic 
equations back to the monotone of a seventh grade teacher, 
we have to put in the Perl code to construct it. 
Enter %item, and $return. At the end of every rule
we have to put in an Action, in which we assign a value to 
$return. The simplest case of this is above. Now, for
the expression rule:  
<pre>
rel_add_mul_shift_expression : 
	<leftop: cast_expression rel_mul_add_ex_op cast_expression>
{
	$return = join('',@{$item[1]}); 
}
</pre>
For the rule "foo : bar baz" we have the @item and %item 
variables with which to build $return. We can find out the 
$return value from the bar subrule by looking at 
$item{bar} or $item[1]. The baz rule gets $item[2]
and $item{baz}. In the case of the rule above, we
have only one subrule (the entire <leftop:...> construct),
and leftop is a special rule that returns a reference
to an anonymous array containing the $return values
for however many cast_expressions and operators get parsed.
Since the hard work of building the English for the 
expression is done in the layers below, here we just 
dereference the array and join() it to a single string.
</p>

<p>

That turns this: 
<pre>
  prod[2][1] = a[2][1]*b[1][1]+a[2][2]*b[2][1];
</pre>
to this: 
<tt>
Assign to array prod's element at address (2,1) the value "array a's element at address (2,1) times array b's element at address (1,1) plus array a's element at address (2,2) times array b's element at address (2,1)".
</tt>
</p>

<p>

For something more interesting on the English side of things,
let's look at our variable declarations (trimming away much detail).
<pre>
declaration :
    declaration_specifiers init_declarator_list(?) ';'
</pre>
'declaration_specifiers' is where we find out the datatype.
'init_declarator' list is where we list the variables we're
declaring, and figure out initializations, and whether some
of these are pointers, et cetera.
<pre>
init_declarator_list : <leftop:  init_declarator ','  init_declarator> 
{
    my @init_decl_list = @{$item[1]}; 
# Once again, we dereference to get the array the <leftop:> 
# construct gives us. There could be 1, 2, or more
# declarators. Each one we have to handle differently. 
    my $last = ''; 

    if ($#init_decl_list > 1) {
# To avoid calling up the ghost of E.B. White, 
# in the plural case we have to separate the declarators
# by commas (funny how English and C share this requirement), and
# add the word "and" before the last one:
        $last = pop @init_decl_list; 
        $return = 'the variables ' . join(', ', @init_decl_list) . ', and ' . $last;
    } elsif ( $#init_decl_list == 1 ) { 
# In the dual case, we don't have to use commas:
        $return = 'the variables ' .  $init_decl_list[0] . ' and ' .$init_decl_list[1];
    } else { 
# And in the singular case things are easy:
        $return = 'the variable ' . $init_decl_list[0]; 
    }
}
</pre>
Now, for the declaration return value, we can say
<pre>
{
	$return = "Specifying the type $item{declaration_specifiers}, allocate "; 
	$return .= join('',@{$item{init_declaration_list}}) . ".\n"; 
}
</pre>
Now we're ready to turn this : 
<pre>
        unsigned int lfsr1_lo,lfsr1_hi,lfsr0,combined;
        byte o_lfsr0, o_lfsr1;
</pre>
to this : 
<tt>
Specifying the type unsigned integer, allocate the variables 'lfsr1_lo', 'lfsr1_hi', 'lfsr0', and 'combined'.
Specifying the type byte, allocate the variables 'o_lfsr0' and 'o_lfsr1'.
</tt>
</p>

<p>

There are a few more tricks we need to use for this task.
For example, we could parse an assignment expression 
succesfully as a complete statement or as part of one.
This is where the %arg hash rescues us. We write the 
rule for an assignment operator this way: 
<pre>
assignment_operator : 
    '=' 
{
    if ($arg{context} eq 'statement') { 
        $return = ['Assign to', ' the value' ] ; 
    } else { 
        $return = ', which is assigned to be '; 
    }
}
</pre>
$arg{context} is a scalar that the assignment_operator is
given from its parent rule. The parent rule assigns it this way:
<pre>
assignment_expression : 
    unary_expression[context => $arg{context}] 
    assignment_operator[context => $arg{context}] 
    assignment_expression[context =>  'assignment_expression'] 
</pre>
In the bracket sets is the definition of the %arg hashes
each subrule is given. Note how in the first two sets 
we are just letting the subrules inherit the same $arg{context},
while the last case gets a new one. Now to figure out the $return
for the assignment expression:
<pre>
    if ($arg{context} eq 'statement' ) { 
        $return .= "${$item{assignment_operator}}[0] $item{unary_expression}${$item{assignment_operator}}
[1] \"$item{assignment_expression}\".\n";
    } else {
        $return = "$item{unary_expression}, $item{assignment_operator} $item{assignment_expression}"; 
    } 
</pre>
So this way the C statement "foo = bar = baz;" gets translated as
"Assign to 'foo' the value 'bar' which is assigned to be 'baz'."
</p>

<p>

Another issue that comes up is when we need to know what comes 
after a rule we've parsed.  For example, when we use '-' as a
unary operator (e.g. "foo = -bar; baz = -1;"), we need to know what
comes afterwards because standard usage in seventh grade is 
"minus b" but "negative one". One way to solve this is to do this
in the rule for a unary operator:
<pre>
unary_operator : [ other things] 
	| '-' ...constant {$return  = 'negative ';}
	| '-' {$return = 'minus ';}
</pre>
What happens is that while the parsing is going on, RecDescent is 
keeping track of a variable called $text, which contains the parts
of the remaining text that have not yet been parsed. By preceding
'constant' with an ellipsis, we get RecDescent to look for a constant
expression but not remove the constant from $text. We could also
go into our block of Perl commands and do "if ($text =~ ..."
with an appropriate regular expression. This way "-1" gets translated
to "negative one" while "-x" gets translated to "minus x".
</p>

<p>

Having figured out that the correct way to verbalize 
mathematical expressions is to spell them out in 
the drone of a math teacher, and that this lets us
make our script work faster since we can cut down the number
of rules in that series from ten to two, we've taken care of 
the hardest issues. Compared to that, it's easy to verbalize
the flow control statements in C, and the function definitions.
The simpler expressions, are, well, simpler. It's easy to 
translate "foo(bar);" to "Perform the function 'foo' 
as applied to the argument 'bar'." In fact, to do 
so requires no more tricks than what we've done in 
the examples above. But there is another issue we have to 
cover, which the is parsing of parentheses in an expression.
</p>

<p>

In mathematics, parentheses allow us to bypass the 
rules of operator precedence. Unfortunately, parems
don't translate well to human discourse, because what 
they conceptually do is set up a context stack. 
Our minds do implement a stack, which lets our 
attention change contexts as we deal with various aspects
of our lives. But that stack is limited to 6 
frames of context (for more elaboration on this, there is the 
wonderful "Godel, Escher, Bach, an Eternal Golden Braid",
by Douglas Hofstader), and in conversation, a person will
avoid hogging more than one of his listener's context frames.
To do so is bad form in speaking and in writing, and this limitation 
plays a role in how our standards of writing and oration are shaped.
</p>

<p>

Painful as it may be to some, let's go back to that day in 
math class. You're taking your teacher's dictation on the 
blackboard, and there's a pair of parentheses. Your teacher
might have just said "open parem, blah blah blah, close parem...",
but for our purposes that's cheating. My 9th grade teacher would 
say "the quantity x plus y, now over z...", letting "the quantity"
and "now" indicate motion on the stack. When we have to use our
listener's stack in conversation, we might indicate such motion with hand
gestures.
</p>

<p>

So, now, what do we do with parenthetical expressions in C code?
We could replace each opening parem with "the quantity" and closing
parem with "now", but what about nested parentheses? Repetitions of 
"the quantity" or of "now" are simply unacceptable for our purposes
because we would never use them in writings or in speech. In class, 
we would raraely open or close consecutive parems, and rarely
nest umpteen pairs of them. So here we have to spell out 
when we open several layers of them. Here's how:
</p>

<p>
<pre>
primary_expression :  
    '('
    expression
    ')' 
{
    my $expression = $item{expression} ; 
    my $repeats = 1 ; 
    my $ending = 1 ; 
# We use these variables to keep track of layer numbers.
# If we have an expression that is already nested in the front,
# we remove the nesting.
    if ($expression =~  /^the (\\d+)-layered parenthetical expression/) { 
        $repeats= $1 +1 ; 
        $expression =~ s/^the \\d+-layered parenthetical expression //;
# If we have to start the nesting, we do this:
    } elsif ($expression =~  /^the parenthetical expression/) { 
        $repeats =2 ; 
        $expression =~ s/^the \\d+-layered parenthetical expression //;
    } 
# So for now the internal parems are gone.
# Now, to the rear of our expression:
    if ($expression =~ / now$/) { 
        $ending ++; 
        $expression =~ s/ now$//; 
        $expression .= " (now drop $ending layers of context)" ; 
    } elsif ($expression =~ /now drop (\\d+) layers of context\)$/ ) { 
        $ending =~ $1 +1; 
        $expression =~ s/\\d+ layers of context\)$/$ending layers of context \)/
; 
    } else { $expression .= ' now'; } 
# Finally, we wrap the expression in our pair of parems:
    if ($repeats > 1) { 
        $return =
            "the $repeats-layered parenthetical expression $expression"; 
    } else { 
        $return =
            "the parenthetical expression $expression"; 
    }
# And one more detail: if we're closing the parentheses at the very
# end of the C statement, we don't need to bother with the word "now."
    if ($text =~ /^;/) {
        $return =~ s/ now$//;
    } 
}
</pre>
</p>

<p>

So now, when we start a set of nested parems, we say "the 22-layered
parenthetical expression" (I'll take a poll on whether "the 22-layered
quantity" makes for better style), and if we close multiple sets, 
we say so: "(now drop 22 layers of context)". Since such a statement
does interrupt the flow of expression, it's good form to put it in,
yes, parentheses.
</p>

<p>

So, the c2eng whole script (1400 lines) is available 
<a href="http://www.mit.edu/~ocschwar/c2eng">
http://www.mit.edu/~ocschwar/c2eng</a>
</p>

<p>

 There are still many extant issues. When the C compiler
sets about to do its work, it first strips out the comments and
runs the C preprocessor. We don't. This means comments and
preprocessor directives could interrupt the C code anywhere
and in any context. The script is limited in its ability to handle that,
at the moment, (it can handle CPP directives and comments only 
between statements) but I am working on it. A run against a certain
piece of controversial C code has succeeded. Finally, there's the issue
of translating C back to English for the purpose of the demonstration.
Since the script spells things out very explicitly, the eng2c reverse script 
should be easier to write than the c2eng, but writing it may have to wait
until c2eng reaches the 1.0 stage.
<p>

</p>
Which brings us to close with the question of why this script
was written. DeCSS is a program that was written to allow 
people to play DVD movies on their Linux machines. It 
decrypts the CSS (Content Scrambling System) encoding 
which then allows for the content to be displayed.  
The program was written by a reverse-engineering endeavor
by amateurs on the net, much to the consternation of the 
DVD-Copy Control Association. 
These hobbyists are spread all over the world,
adding to the legal complications in the case. The DVD-CCA
has started two lawsuits thus far, and one of these recently 
concluded. In a lawsuit in federal court in New York,
Emannuel Goldstein, the editor of 2600 magazine, was 
barred from hosting the source code to DeCSS and, most amazingly,
from linking to other sites that might provide it. 
In his ruling, Judge Kaplan does not deny that 
source code is a form of expression, and he acknowledges
the continuum from "idea to human language to source code 
to object code". And he acknowledges the decreasing number of people
who can understand an idea as one mvoes from human language down 
to object code.
<p>

</p>

This is where I have to protest Kaplan's decision to bar
distribution of DeCSS, and I will explain this with an analogy. 
If I buy an airplane, all I could do with it is show it off in the 
backyard. I have never flown a plane, and if I tried, my next-door 
neighbors would be very upset when the flaming wreckage punctures
the inflatable wading pool in their backyard. This is why by law
I cannot not fly until I obtain a pilot's license. This is one case
where we all can buy and own something, but cannot use it fully
until the government is confident in our ability to leave
people's wading pools undamaged. 
<p>

</p>

A computer is not an airplane or a semitrailer. 
As much as the computer programming profession pays, and
as much as it might some day be revered, this trade does
not merit a legally enshrined priesthood. Any person with 
enough skill can download the legally-untouchable 
descriptions of the CSS reverse engineering efforts, 
and with that write a version of DeCSS. That person would
not, however, be allowed to share the same idea in executable
form with those not similarly skilled. Beside Kaplan's far-too
dismissive handling of the free speech issues involved, 
he makes a bad precedent by establishing a legal privilege
in a profession that should not have one. He justifies his
action in claiming that there is enough of a compelling 
state interest in this case. It is my hope that the 
in appeals in the DeCSS litigation, the appelate judges will
deem the state interest as not compelling enough, rather than
supress the distribution of what will demonstrably be
simple English sentences. 
<p>

</p>

There are other uses to the script. It is a useful demonstration
of what a program is doing, and thus an instruction tool, 
<p>

</p>

After writing the forward script, it 
came time to write the reverse. To do so I had to set
a few ground rules:
<p>

</p>

1. Do not, in any part of the code, use whitespace
as a parsing token. Output from the forward script should
withstand word wrapping without breaking its legibility to the
reverse script, meaning that the tab or the carriage return
should be just as good as the whitespace as a token separator. 
<p>

</p>

Writing the rules for the reverse script meant having to
take the strings I hardcoded into the forward script and
interspersing them with quotes, i.e. s/ /' '/g.
For example, the forward script says that inclusions are worded thus:
<pre>
{
      $return = "This program makes use of the system file $item{filename}.\n" ;  
}
</pre>
The reverse rule, then, must work like this:
<pre>
inclusion : 
    'This' 'program' 'makes' 'use' 'of' 'the' 'system' 'file'
    filename '.'
{
    $return =
        "\n#include <$item{filename}>\n"; 
}
</pre>
<p>

</p>

Modularizing the two scripts to allow for more creative
conversions will be more difficult because of this restriction,
but the enterprising neurolinguistic hacker will write
a quick s/ /' '/ pipe and solve the problem thus.
<p>

</p>

2. Whenever possible, copy the forward script's
grammar into the reverse script. This rule is 
easier to follow than to violate, and so is a good
rule to remember! 
<p>

</p>

<& footers.comp &>