c2eng, the How To, Part 2

FAQ

Perl stuff

Mason stuff

Tk stuff

RecDescent
stuff

C stuff

Randomness

Punditry

Links

For something more interesting on the English side of things, let's look at our variable declarations (trimming away much detail).


declaration :
    declaration_specifiers init_declarator_list(?) ';'

'declaration_specifiers' is where we find out the datatype. 'init_declarator' list is where we list the variables we're declaring, and figure out initializations, and whether some of these are pointers, et cetera.


init_declarator_list :  
{
    my @init_decl_list = @{$item[1]}; 
# Once again, we dereference to get the array the  
# construct gives us. There could be 1, 2, or more
# declarators. Each one we have to handle differently. 
    my $last = ''; 

    if ($#init_decl_list > 1) {
# To avoid calling up the ghost of E.B. White, 
# in the plural case we have to separate the declarators
# by commas (funny how English and C share this requirement), and
# add the word "and" before the last one:
        $last = pop @init_decl_list; 
        $return = 
           'the variables ' . join(', ', @init_decl_list) . ', and ' . $last;
    } elsif ( $#init_decl_list == 1 ) { 
# In the dual case, we don't have to use commas:
        $return = 
           'the variables ' .  
           $init_decl_list[0] . ' and ' .$init_decl_list[1];
    } else { 
# And in the singular case things are easy:
        $return = 'the variable ' . $init_decl_list[0]; 
    }
}

Now, for the declaration return value, we can say


{
	$return = 
            "Specifying the type $item{declaration_specifiers}, allocate "; 
	$return .= join('',@{$item{init_declaration_list}}) . ".\n"; 
}

Now we're ready to turn this :


        unsigned int lfsr1_lo,lfsr1_hi,lfsr0,combined;
        byte o_lfsr0, o_lfsr1;

to this : Specifying the type unsigned integer, allocate the variables 'lfsr1_lo', 'lfsr1_hi', 'lfsr0', and 'combined'. Specifying the type byte, allocate the variables 'o_lfsr0' and 'o_lfsr1'.

There are a few more tricks we need to use for this task. For example, we could parse an assignment expression succesfully as a complete statement or as part of one. This is where the %arg hash rescues us. We write the rule for an assignment operator this way:


assignment_operator : 
    '=' 
{
    if ($arg{context} eq 'statement') { 
        $return = ['Assign to', ' the value' ] ; 
    } else { 
        $return = ', which is assigned to be '; 
    }
}

$arg{context} is a scalar that the assignment_operator is given from its parent rule. The parent rule assigns it this way:


assignment_expression : 
    unary_expression[context => $arg{context}] 
    assignment_operator[context => $arg{context}] 
    assignment_expression[context =>  'assignment_expression']

In the bracket sets is the definition of the %arg hashes each subrule is given. Note how in the first two sets we are just letting the subrules inherit the same $arg{context}, while the last case gets a new one. Now to figure out the $return for the assignment expression:


    if ($arg{context} eq 'statement' ) { 
        $return .= 
           "${$item{assignment_operator}}[0] "
           ."$item{unary_expression}${$item{assignment_operator}}
[1] \"$item{assignment_expression}\".\n";
    } else {
        $return = 
          "$item{unary_expression}, ".
          "$item{assignment_operator} $item{assignment_expression}"; 
    }

So this way the C statement "foo = bar = baz;" gets translated as "Assign to 'foo' the value 'bar' which is assigned to be 'baz'."

Another issue that comes up is when we need to know what comes after a rule we've parsed. For example, when we use '-' as a unary operator (e.g. "foo = -bar; baz = -1;"), we need to know what comes afterwards because standard usage in seventh grade is "minus b" but "negative one". One way to solve this is to do this in the rule for a unary operator:


unary_operator : [ other things] 
	| '-' ...constant {$return  = 'negative ';}
	| '-' {$return = 'minus ';}

What happens is that while the parsing is going on, RecDescent is keeping track of a variable called $text, which contains the parts of the remaining text that have not yet been parsed. By preceding 'constant' with an ellipsis, we get RecDescent to look for a constant expression but not remove the constant from $text. We could also go into our block of Perl commands and do "if ($text =~ ..." with an appropriate regular expression. This way "-1" gets translated to "negative one" while "-x" gets translated to "minus x".

Having figured out that the correct way to verbalize mathematical expressions is to spell them out in the drone of a math teacher, and that this lets us make our script work faster since we can cut down the number of rules in that series from ten to two, we've taken care of the hardest issues. Compared to that, it's easy to verbalize the flow control statements in C, and the function definitions. The simpler expressions, are, well, simpler. It's easy to translate "foo(bar);" to "Perform the function 'foo' as applied to the argument 'bar'." In fact, to do so requires no more tricks than what we've done in the examples above. But there is another issue we have to cover, which the is parsing of parentheses in an expression.

In mathematics, parentheses allow us to bypass the rules of operator precedence. Unfortunately, parems don't translate well to human discourse, because what they conceptually do is set up a context stack. Our minds do implement a stack, which lets our attention change contexts as we deal with various aspects of our lives. But that stack is limited to 6 frames of context (for more elaboration on this, there is the wonderful "Godel, Escher, Bach, an Eternal Golden Braid", by Douglas Hofstader), and in conversation, a person will avoid hogging more than one of his listener's context frames. To do so is bad form in speaking and in writing, and this limitation plays a role in how our standards of writing and oration are shaped.

Painful as it may be to some, let's go back to that day in math class. You're taking your teacher's dictation on the blackboard, and there's a pair of parentheses. Your teacher might have just said "open parem, blah blah blah, close parem...", but for our purposes that's cheating. My 9th grade teacher would say "the quantity x plus y, now over z...", letting "the quantity" and "now" indicate motion on the stack. When we have to use our listener's stack in conversation, we might indicate such motion with hand gestures.

So, now, what do we do with parenthetical expressions in C code? We could replace each opening parem with "the quantity" and closing parem with "now", but what about nested parentheses? Repetitions of "the quantity" or of "now" are simply unacceptable for our purposes because we would never use them in writings or in speech. In class, we would raraely open or close consecutive parems, and rarely nest umpteen pairs of them. So here we have to spell out when we open several layers of them. Here's how:


primary_expression :  
    '('
    expression
    ')' 
{
    my $expression = $item{expression} ; 
    my $repeats = 1 ; 
    my $ending = 1 ; 
# We use these variables to keep track of layer numbers.
# If we have an expression that is already nested in the front,
# we remove the nesting.
    if ($expression =~  /^the (\\d+)-layered parenthetical expression/) { 
        $repeats= $1 +1 ; 
        $expression =~ s/^the \\d+-layered parenthetical expression //;
# If we have to start the nesting, we do this:
    } elsif ($expression =~  /^the parenthetical expression/) { 
        $repeats =2 ; 
        $expression =~ s/^the \\d+-layered parenthetical expression //;
    } 
# So for now the internal parems are gone.
# Now, to the rear of our expression:
    if ($expression =~ / now$/) { 
        $ending ++; 
        $expression =~ s/ now$//; 
        $expression .= " (now drop $ending layers of context)" ; 
    } elsif ($expression =~ /now drop (\\d+) layers of context\)$/ ) { 
        $ending =~ $1 +1; 
        $expression =~ 
            s/\\d+ layers of context\)$/$ending layers of context \)/; 
    } else { $expression .= ' now'; } 
# Finally, we wrap the expression in our pair of parems:
    if ($repeats > 1) { 
        $return =
            "the $repeats-layered parenthetical expression $expression"; 
    } else { 
        $return =
            "the parenthetical expression $expression"; 
    }
# And one more detail: if we're closing the parentheses at the very
# end of the C statement, we don't need to bother with the word "now."
    if ($text =~ /^;/) {
        $return =~ s/ now$//;
    } 
}

So now, when we start a set of nested parems, we say "the 22-layered parenthetical expression" (I'll take a poll on whether "the 22-layered quantity" makes for better style), and if we close multiple sets, we say so: "(now drop 22 layers of context)". Since such a statement does interrupt the flow of expression, it's good form to put it in, yes, parentheses.

So, the c2eng whole script (1400 lines) is available http://www.mit.edu/~ocschwar/c2eng

There are still many extant issues. When the C compiler sets about to do its work, it first strips out the comments and runs the C preprocessor. We don't. This means comments and preprocessor directives could interrupt the C code anywhere and in any context. The script is limited in its ability to handle that, at the moment, (it can handle CPP directives and comments only between statements) but I am working on it. A run against a certain piece of controversial C code has succeeded. Finally, there's the issue of translating C back to English for the purpose of the demonstration.

Which brings us to close with the question of why this script was written. DeCSS is a program that was written to allow people to play DVD movies on their Linux machines. It decrypts the CSS (Content Scrambling System) encoding which then allows for the content to be displayed. The program was written by a reverse-engineering endeavor by amateurs on the net, much to the consternation of the DVD-Copy Control Association. These hobbyists are spread all over the world, adding to the legal complications in the case. The DVD-CCA has started two lawsuits thus far, and one of these recently concluded. In a lawsuit in federal court in New York, Emannuel Goldstein, the editor of 2600 magazine, was barred from hosting the source code to DeCSS and, most amazingly, from linking to other sites that might provide it. In his ruling, Judge Kaplan does not deny that source code is a form of expression, and he acknowledges the continuum from "idea to human language to source code to object code". And he acknowledges the decreasing number of people who can understand an idea as one moves from human language down to object code.

This is where I have to protest Kaplan's decision to bar distribution of DeCSS, and I will explain this with an analogy. If I buy an airplane, all I could do with it is show it off in the backyard. I have never flown a plane, and if I tried, my next-door neighbors would be very upset when the flaming wreckage punctures the inflatable wading pool in their backyard. This is why by law I cannot not fly until I obtain a pilot's license. This is one case where we all can buy and own something, but cannot use it fully until the government is confident in our ability to leave people's wading pools undamaged.

A computer is not an airplane or a semitrailer. As much as the computer programming profession pays, and as much as it might some day be revered, this trade does not merit a legally enshrined priesthood. Any person with enough skill can download the legally-untouchable descriptions of the CSS reverse engineering efforts, and with that write a version of DeCSS. That person would not, however, be allowed to share the same idea in executable form with those not similarly skilled. Beside Kaplan's far-too dismissive handling of the free speech issues involved, he makes a bad precedent by establishing a legal privilege in a profession that should not have one. He justifies his action in claiming that there is enough of a compelling state interest in this case. It is my hope that the in appeals in the DeCSS litigation, the appelate judges will deem the state interest as not compelling enough, rather than supress the distribution of what will demonstrably be simple English sentences.

There are other uses to the script. It is a useful demonstration of what a program is doing, and thus an instruction tool,

After writing the forward script, it came time to write the reverse. To do so I had to set a few ground rules:

1. Do not, in any part of the code, use whitespace as a parsing token. Output from the forward script should withstand word wrapping without breaking its legibility to the reverse script, meaning that the tab or the carriage return should be just as good as the whitespace as a token separator.

Writing the rules for the reverse script meant having to take the strings I hardcoded into the forward script and interspersing them with quotes, i.e. s/ /' '/g. For example, the forward script says that inclusions are worded thus:


{
      $return = 
         "This program makes use of the system file $item{filename}.\n" ;  
}

The reverse rule, then, must work like this:


inclusion : 
    'This' 'program' 'makes' 'use' 'of' 'the' 'system' 'file'
    filename '.'
{
    $return =
        "\n#include <$item{filename}>\n"; 
}

Modularizing the two scripts to allow for more creative conversions will be more difficult because of this restriction, but the enterprising neurolinguistic hacker will write a quick s/ /' '/ pipe and solve the problem thus.

2. Whenever possible, copy the forward script's grammar into the reverse script. This rule is easier to follow than to violate, and so is a good rule to remember!

Omri Schwarz, March 9, 2001