<& headers.test , 'title' => 'c2eng, the How To' , sidebar_subtopic => 'perl:RecDescent' &>

This is the pre-edited version of my article in the Perl Journal.

I've been loosely following the litigation regarding the DeCSS code (more on that later) and I am interested in the issue of whether computer source code should be regarded as a form of protected expression. I side firmly in the affirmative, so I set out to write a program that takes ANSI-C code and translates it to gramatically correct descriptive English sentences. That means taking the C preprocessing statement: "#include " and turning it into "This program makes use of the system file math.h." Or turning "#define Pi 3.14159265358979" and turning it into "Note: we define Pi to mean '3.14159265358979'" The global declaration "int I,J,K;" becomes "Specifying the type integer, allocate the variables 'I', 'J', and 'K'."

And so on and so forth. Needless to say, the tool I chose for this task is Perl. I could have tried with yacc, but I knew more Perl than yacc.

The first thing I had to do was ask around Usenet for suggestions as to how to go about go about this task. Several folks pointed me in the direction of the Parse::RecDescent module. Having downloaded the module and podfile, I looked up more about yacc and got cozy with the C grammar at the appendix of Kernighan and Ritchie Second Edition. I also looked at the Parse::Yapp pod file. Unfortunately, I hopped around these sources inconsistently, because of the special constraints offered by the problem.

I concluded (and still stand by the conclusion) that Parse::RecDescent is te best tool for the task. My approach was to start with a loosely altered version of the K&R grammar, and run it against a sample of C code, modifying as I went along. That was a singularly bad idea. I'll get back to why in a little bit, but this is where we should come to an intro to RecDescent.

Parse::RecDescent offers several advantages over other parsers for this task, and here is why: a typical entry in a grammar looks like this:

	Rule : Subrule1 Subrule2 
	Subrule1 : Subrule3 
But with Parse::RecDescent, you can modify the rule to do this:
	Rule : Subrule1 Subrule2 
	{ $return = $item{subrule2} . $item{subrule1}; } 
This lets us assemble a return string ($return) in any Perly way we wish and return it to the superrule that called Rule. This was exactly the flexibility that was needed here. It lets us throw in various strings like "specified as the type" in the right places to turn a C statement into an English one.

Now, at the highest level of parsing in our case, we can run into a global variable declaration, a function prototype or definition, a comment, or a preprocessor directive. So our first rule looks like this:

		
startrule : (
	preproc {print $item{preproc};}
	| comment {print $item{comment};}
	| global_var_declaration {print $item{global_var_declaration};} 
	| function_definition {print $item{function_definition};}
	| function_prototype {print $item{function_prototype};}
)(s)
This tells RecDescent to collect the return string from whatever parses correctly, and then just print it out. We can only do the print statements at the top level because during the parsing we will parse many things successfully as part of unsucessful parsings. The parentheses around the subrules are here to say that we will find more than one "startrule" in our input text.

By this point I had posted several questions to comp.lang.perl.misc and gotten assistance from Damian Conway and Randal Schwartz, (by 'assistance' I mean being walked by the hand through a lot of issues). So I posted one more time and Damian sent me a C grammar that had been written for RecDescent. Since the grammar had been written without interruption by the things I was doing, it was far better for this purpose.

So if we weren't using recursive descent, we would have rules in our grammar looking like this:

additive_expression : 
	multiplicative_expression 
	| additive_expression '+' multiplicative_expression 
	| additive_expression '-' multiplicative_expression 

multiplicative_expression : 
	cast_expression 
	| multiplicative_expression '*' cast_expression 
	| multiplicative_expression '/' cast_expression 
	| multiplicative_expression '%' cast_expression 
The rule for a multiplicative expression is the last in a long series covering the logical, bitwise, and other operators. But RecDescent will refuse to work with the rules as we have them here. So, let's concentrace on the additive expression. To make the rule logically equivalent but not left-recursive,we have the feature, which speeds up the parsing. This rule is equivalent to the previous, but works faster:
additive_expression : 
	 
add_op : '-' | '+'
If life were simple, we'd stop here. But by translating the entire C grammar to this pattern we have a grammar that can be overwhelmed by the following statement from a legally-troubled piece of source code:
	o_lfsr0 = (((((((lfsr0>>8)^lfsr0)>>1)^lfsr0)>>3)^lfsr0)>>7);
Using this set of parentheses lets the instruction fly fast at run time. But our rule for an expression in parentheses is several layers below a multiplicative expression. By adding to the rules a print statement, I got to see what was hapening:
primary_expression : [other stuff] 
	| '(' expression 
# note we go all the way back to the top of the grammar here. 
	')' {print STDERR "$return\n";} 
The constant stream of text to standard error showed a beautiful pattern. For every twelve or so times our grammar parsed the innermost expression, it parsed the innermost two. For every twelve or so times it got the innermost two, it got the innermost three. Unfortunately, beauty is not our main goal here. Luckily, since we're not writing an actual C compiler, we have a little leeway. Think back to seventh grade. Your teacher is dictating a long equation and you write it on the board. While you're taking the dictation, you do not need to care about operator precedence. So, in this context, we don't either.

So after much cutting and pasting, the ten rules for operator expressions were condensed to two rules. Here's one rule:

rel_mul_add_ex_op :
	'+' {$return = ' plus ';}
	| '-' {$return = ' minus '; }
	| '*' {$return = ' times '; }
	| '/' {$return = ' divided by '; }
	| '%' {$return = ' modulo '; }
	|'<<' {$return = ' shifted left by ';}
	| '>>' {$return = ' shifted right by ';}
	| '>='  {$return = ' checked to be greater than or equal to ';}
	| "<="  {$return = ' checked to be less than or equal to ';}
	| '>' {$return = ' checked to be greater than ';}
	| '<'  {$return = ' checked to be less than ';}
	
rel_add_mul_shift_expression : 
	
So now's the time for another digression. In 1557 a scholar named Robert Recorde writing a treatise on arithmetic wrote this famous quote:
To avoide the tediouse repetition of these woordes: is equalle to: I will settle as I doe often in woorke use, a paire of paralleles, or gemowe [twin] lines of one length e: =, bicause noe .2. thynges, can be moare equalle.
Yes, mathematical notation, our "universal language", began with the impatience of a man writing an entire book by hand, in a dank unlighted room, with an inkwell and a quill. Since our mathematical symbols are actually contractions for normal English woordes, we can take any equation and construct a gramatically correct sentence with it.

For example, "the square of the hypotenuse is equal to the square of the first leg plus the square of the second leg." This should explain to you the return values I'm using in the rule for operators. Of course, we can only turn an equation into a sentence if its variables stand for something. Alternatively, we can redefine "English" to include nouns, verbs, adjectives, adverbs, prepositions, conjunctions, interjections, and variable names. Compared to what happens in corporate press releases, this is a far milder assault on the purity of the language.

Having figured out how to descend recursively through a left-associative grammar, and translate algebraic equations back to the monotone of a seventh grade teacher, we have to put in the Perl code to construct it. Enter %item, and $return. At the end of every rule we have to put in an Action, in which we assign a value to $return. The simplest case of this is above. Now, for the expression rule:

rel_add_mul_shift_expression : 
	
{
	$return = join('',@{$item[1]}); 
}
For the rule "foo : bar baz" we have the @item and %item variables with which to build $return. We can find out the $return value from the bar subrule by looking at $item{bar} or $item[1]. The baz rule gets $item[2] and $item{baz}. In the case of the rule above, we have only one subrule (the entire construct), and leftop is a special rule that returns a reference to an anonymous array containing the $return values for however many cast_expressions and operators get parsed. Since the hard work of building the English for the expression is done in the layers below, here we just dereference the array and join() it to a single string.

That turns this:

  prod[2][1] = a[2][1]*b[1][1]+a[2][2]*b[2][1];
to this: Assign to array prod's element at address (2,1) the value "array a's element at address (2,1) times array b's element at address (1,1) plus array a's element at address (2,2) times array b's element at address (2,1)".

For something more interesting on the English side of things, let's look at our variable declarations (trimming away much detail).

declaration :
    declaration_specifiers init_declarator_list(?) ';'
'declaration_specifiers' is where we find out the datatype. 'init_declarator' list is where we list the variables we're declaring, and figure out initializations, and whether some of these are pointers, et cetera.
init_declarator_list :  
{
    my @init_decl_list = @{$item[1]}; 
# Once again, we dereference to get the array the  
# construct gives us. There could be 1, 2, or more
# declarators. Each one we have to handle differently. 
    my $last = ''; 

    if ($#init_decl_list > 1) {
# To avoid calling up the ghost of E.B. White, 
# in the plural case we have to separate the declarators
# by commas (funny how English and C share this requirement), and
# add the word "and" before the last one:
        $last = pop @init_decl_list; 
        $return = 'the variables ' . join(', ', @init_decl_list) . ', and ' . $last;
    } elsif ( $#init_decl_list == 1 ) { 
# In the dual case, we don't have to use commas:
        $return = 'the variables ' .  $init_decl_list[0] . ' and ' .$init_decl_list[1];
    } else { 
# And in the singular case things are easy:
        $return = 'the variable ' . $init_decl_list[0]; 
    }
}
Now, for the declaration return value, we can say
{
	$return = "Specifying the type $item{declaration_specifiers}, allocate "; 
	$return .= join('',@{$item{init_declaration_list}}) . ".\n"; 
}
Now we're ready to turn this :
        unsigned int lfsr1_lo,lfsr1_hi,lfsr0,combined;
        byte o_lfsr0, o_lfsr1;
to this : Specifying the type unsigned integer, allocate the variables 'lfsr1_lo', 'lfsr1_hi', 'lfsr0', and 'combined'. Specifying the type byte, allocate the variables 'o_lfsr0' and 'o_lfsr1'.

There are a few more tricks we need to use for this task. For example, we could parse an assignment expression succesfully as a complete statement or as part of one. This is where the %arg hash rescues us. We write the rule for an assignment operator this way:

assignment_operator : 
    '=' 
{
    if ($arg{context} eq 'statement') { 
        $return = ['Assign to', ' the value' ] ; 
    } else { 
        $return = ', which is assigned to be '; 
    }
}
$arg{context} is a scalar that the assignment_operator is given from its parent rule. The parent rule assigns it this way:
assignment_expression : 
    unary_expression[context => $arg{context}] 
    assignment_operator[context => $arg{context}] 
    assignment_expression[context =>  'assignment_expression'] 
In the bracket sets is the definition of the %arg hashes each subrule is given. Note how in the first two sets we are just letting the subrules inherit the same $arg{context}, while the last case gets a new one. Now to figure out the $return for the assignment expression:
    if ($arg{context} eq 'statement' ) { 
        $return .= "${$item{assignment_operator}}[0] $item{unary_expression}${$item{assignment_operator}}
[1] \"$item{assignment_expression}\".\n";
    } else {
        $return = "$item{unary_expression}, $item{assignment_operator} $item{assignment_expression}"; 
    } 
So this way the C statement "foo = bar = baz;" gets translated as "Assign to 'foo' the value 'bar' which is assigned to be 'baz'."

Another issue that comes up is when we need to know what comes after a rule we've parsed. For example, when we use '-' as a unary operator (e.g. "foo = -bar; baz = -1;"), we need to know what comes afterwards because standard usage in seventh grade is "minus b" but "negative one". One way to solve this is to do this in the rule for a unary operator:

unary_operator : [ other things] 
	| '-' ...constant {$return  = 'negative ';}
	| '-' {$return = 'minus ';}
What happens is that while the parsing is going on, RecDescent is keeping track of a variable called $text, which contains the parts of the remaining text that have not yet been parsed. By preceding 'constant' with an ellipsis, we get RecDescent to look for a constant expression but not remove the constant from $text. We could also go into our block of Perl commands and do "if ($text =~ ..." with an appropriate regular expression. This way "-1" gets translated to "negative one" while "-x" gets translated to "minus x".

Having figured out that the correct way to verbalize mathematical expressions is to spell them out in the drone of a math teacher, and that this lets us make our script work faster since we can cut down the number of rules in that series from ten to two, we've taken care of the hardest issues. Compared to that, it's easy to verbalize the flow control statements in C, and the function definitions. The simpler expressions, are, well, simpler. It's easy to translate "foo(bar);" to "Perform the function 'foo' as applied to the argument 'bar'." In fact, to do so requires no more tricks than what we've done in the examples above. But there is another issue we have to cover, which the is parsing of parentheses in an expression.

In mathematics, parentheses allow us to bypass the rules of operator precedence. Unfortunately, parems don't translate well to human discourse, because what they conceptually do is set up a context stack. Our minds do implement a stack, which lets our attention change contexts as we deal with various aspects of our lives. But that stack is limited to 6 frames of context (for more elaboration on this, there is the wonderful "Godel, Escher, Bach, an Eternal Golden Braid", by Douglas Hofstader), and in conversation, a person will avoid hogging more than one of his listener's context frames. To do so is bad form in speaking and in writing, and this limitation plays a role in how our standards of writing and oration are shaped.

Painful as it may be to some, let's go back to that day in math class. You're taking your teacher's dictation on the blackboard, and there's a pair of parentheses. Your teacher might have just said "open parem, blah blah blah, close parem...", but for our purposes that's cheating. My 9th grade teacher would say "the quantity x plus y, now over z...", letting "the quantity" and "now" indicate motion on the stack. When we have to use our listener's stack in conversation, we might indicate such motion with hand gestures.

So, now, what do we do with parenthetical expressions in C code? We could replace each opening parem with "the quantity" and closing parem with "now", but what about nested parentheses? Repetitions of "the quantity" or of "now" are simply unacceptable for our purposes because we would never use them in writings or in speech. In class, we would raraely open or close consecutive parems, and rarely nest umpteen pairs of them. So here we have to spell out when we open several layers of them. Here's how:

primary_expression :  
    '('
    expression
    ')' 
{
    my $expression = $item{expression} ; 
    my $repeats = 1 ; 
    my $ending = 1 ; 
# We use these variables to keep track of layer numbers.
# If we have an expression that is already nested in the front,
# we remove the nesting.
    if ($expression =~  /^the (\\d+)-layered parenthetical expression/) { 
        $repeats= $1 +1 ; 
        $expression =~ s/^the \\d+-layered parenthetical expression //;
# If we have to start the nesting, we do this:
    } elsif ($expression =~  /^the parenthetical expression/) { 
        $repeats =2 ; 
        $expression =~ s/^the \\d+-layered parenthetical expression //;
    } 
# So for now the internal parems are gone.
# Now, to the rear of our expression:
    if ($expression =~ / now$/) { 
        $ending ++; 
        $expression =~ s/ now$//; 
        $expression .= " (now drop $ending layers of context)" ; 
    } elsif ($expression =~ /now drop (\\d+) layers of context\)$/ ) { 
        $ending =~ $1 +1; 
        $expression =~ s/\\d+ layers of context\)$/$ending layers of context \)/
; 
    } else { $expression .= ' now'; } 
# Finally, we wrap the expression in our pair of parems:
    if ($repeats > 1) { 
        $return =
            "the $repeats-layered parenthetical expression $expression"; 
    } else { 
        $return =
            "the parenthetical expression $expression"; 
    }
# And one more detail: if we're closing the parentheses at the very
# end of the C statement, we don't need to bother with the word "now."
    if ($text =~ /^;/) {
        $return =~ s/ now$//;
    } 
}

So now, when we start a set of nested parems, we say "the 22-layered parenthetical expression" (I'll take a poll on whether "the 22-layered quantity" makes for better style), and if we close multiple sets, we say so: "(now drop 22 layers of context)". Since such a statement does interrupt the flow of expression, it's good form to put it in, yes, parentheses.

So, the c2eng whole script (1400 lines) is available http://www.mit.edu/~ocschwar/c2eng

There are still many extant issues. When the C compiler sets about to do its work, it first strips out the comments and runs the C preprocessor. We don't. This means comments and preprocessor directives could interrupt the C code anywhere and in any context. The script is limited in its ability to handle that, at the moment, (it can handle CPP directives and comments only between statements) but I am working on it. A run against a certain piece of controversial C code has succeeded. Finally, there's the issue of translating C back to English for the purpose of the demonstration. Since the script spells things out very explicitly, the eng2c reverse script should be easier to write than the c2eng, but writing it may have to wait until c2eng reaches the 1.0 stage.

Which brings us to close with the question of why this script was written. DeCSS is a program that was written to allow people to play DVD movies on their Linux machines. It decrypts the CSS (Content Scrambling System) encoding which then allows for the content to be displayed. The program was written by a reverse-engineering endeavor by amateurs on the net, much to the consternation of the DVD-Copy Control Association. These hobbyists are spread all over the world, adding to the legal complications in the case. The DVD-CCA has started two lawsuits thus far, and one of these recently concluded. In a lawsuit in federal court in New York, Emannuel Goldstein, the editor of 2600 magazine, was barred from hosting the source code to DeCSS and, most amazingly, from linking to other sites that might provide it. In his ruling, Judge Kaplan does not deny that source code is a form of expression, and he acknowledges the continuum from "idea to human language to source code to object code". And he acknowledges the decreasing number of people who can understand an idea as one mvoes from human language down to object code.

This is where I have to protest Kaplan's decision to bar distribution of DeCSS, and I will explain this with an analogy. If I buy an airplane, all I could do with it is show it off in the backyard. I have never flown a plane, and if I tried, my next-door neighbors would be very upset when the flaming wreckage punctures the inflatable wading pool in their backyard. This is why by law I cannot not fly until I obtain a pilot's license. This is one case where we all can buy and own something, but cannot use it fully until the government is confident in our ability to leave people's wading pools undamaged.

A computer is not an airplane or a semitrailer. As much as the computer programming profession pays, and as much as it might some day be revered, this trade does not merit a legally enshrined priesthood. Any person with enough skill can download the legally-untouchable descriptions of the CSS reverse engineering efforts, and with that write a version of DeCSS. That person would not, however, be allowed to share the same idea in executable form with those not similarly skilled. Beside Kaplan's far-too dismissive handling of the free speech issues involved, he makes a bad precedent by establishing a legal privilege in a profession that should not have one. He justifies his action in claiming that there is enough of a compelling state interest in this case. It is my hope that the in appeals in the DeCSS litigation, the appelate judges will deem the state interest as not compelling enough, rather than supress the distribution of what will demonstrably be simple English sentences.

There are other uses to the script. It is a useful demonstration of what a program is doing, and thus an instruction tool,

After writing the forward script, it came time to write the reverse. To do so I had to set a few ground rules:

1. Do not, in any part of the code, use whitespace as a parsing token. Output from the forward script should withstand word wrapping without breaking its legibility to the reverse script, meaning that the tab or the carriage return should be just as good as the whitespace as a token separator.

Writing the rules for the reverse script meant having to take the strings I hardcoded into the forward script and interspersing them with quotes, i.e. s/ /' '/g. For example, the forward script says that inclusions are worded thus:
{
      $return = "This program makes use of the system file $item{filename}.\n" ;  
}
The reverse rule, then, must work like this:
inclusion : 
    'This' 'program' 'makes' 'use' 'of' 'the' 'system' 'file'
    filename '.'
{
    $return =
        "\n#include <$item{filename}>\n"; 
}

Modularizing the two scripts to allow for more creative conversions will be more difficult because of this restriction, but the enterprising neurolinguistic hacker will write a quick s/ /' '/ pipe and solve the problem thus.

2. Whenever possible, copy the forward script's grammar into the reverse script. This rule is easier to follow than to violate, and so is a good rule to remember!

<& footers.comp &>