Documentation for ParseGen. [Revision 3] Copyright 1993, 1995 William Clinger Permission to copy this software, in whole or in part, to use this software for any lawful purpose, and to redistribute this software is granted subject to the restriction that all copies made of this software must include this copyright notice in full. I also request that you send me a copy of any improvements that you make to this software so that they may be incorporated within it to the benefit of the Scheme community. ParseGen is a strong LL(1) parser generator designed for use in education and research. It makes no attempt to convert a non-LL(1) grammar into LL(1) form; you have to do that yourself. Unlike most parser generators, it can generate a parser in any of several different languages. I have written code generators for Scheme, Pascal, C, and Java, and it should not be hard to modify these generators for other languages. The generated parsers use recursive descent, so they are easy to read and modify. These peculiar characteristics are what I want when I teach a course on compiler construction. Students must convert the grammar they are given into LL(1) form, write a lexical analyzer, design an intermediate representation, and write the action procedures that generate that representation, but are spared the trivial, tedious, and error-prone task of translating their LL(1) grammar into code. This directory contains the following files and subdirectories: README this file follow.sch computes director sets and tests LL(1) condition loadparsegen.sch loads the entire parser generator parsegen.c.sch generates C code parsegen.java.sch generates Java code parsegen.pascal.sch generates Pascal code parsegen.scheme.sch generates Scheme code parsegen.sch parser generator (target-independent part) parsegen0.pg grammar for inputs to ParseGen parsegen0.sch parser for inputs to ParseGen sets.sch operations on sets represented as lists parsegen0.pg is an example of the input to ParseGen, and parsegen0.sch contains, among other things, the parser generated from parsegen0.pg. The parser generator can be loaded into Scheme by loading loadparsegen.sch. There are four entry points, described below. If "mygrammar" is a file containing an LL(1) grammar in the format described below, then (generate-scheme "mygrammar" "myparser.sch" "tables") (generate-pascal "mygrammar" "myparser.pas" "tables") (generate-c "mygrammar" "myparser.c" "tables") (generate-java "mygrammar" "myparser.c" "tables") generates a recursive descent parser, written in IEEE/ANSI Scheme, ISO Pascal, ANSI C, or Java, respectively, writes the code for the parser to the file named by the second argument, and writes a table of symbolic names for tokens and a table of action procedures to the file named by the third argument. The output files named by the second and third arguments must not already exist. The third argument, or both the second and the third argument, may be omitted, in which case the current output port is used instead of a file. The generated parser communicates with a lexical analyzer, which must be supplied separately, through the following procedures: (next-token) returns the kind of the current lookahead token (consume-token!) consumes the lookahead token In Pascal, C, and Java, these procedures are named nextToken and consumeToken, in accord with the parser generator's general rules for translating Scheme identifiers into Pascal, C, and Java. In Scheme, the generated parser assumes that each token is represented as a symbol. In Pascal, C, or Java, the parser generator generates declarations for three data types: token, tokens, and nonterminal. In Pascal and C, the parser represents a token or nonterminal as an enumerated value; in Java, an int constant is used instead. Input to ParseGen. The input to ParseGen is very similar to the input accepted by the LLGen parser generator that is described in Appendix C of "Crafting a Compiler with C" by Fischer and LeBlanc Jr. Most inputs that are accepted by LLGen will also be accepted by ParseGen, but ParseGen will ignore the *fmq and *define sections of the input to LLGen. The input is an LL(1) grammar in which each production is augmented by the name of a procedure that will perform any actions needed for that production and then return a synthesized value of type ast (which can be any type). The arguments to the action procedure will be the values synthesized for the nonterminals that occur on the right hand side of the production. These arguments can be suppressed by evaluating (actions-take-arguments #f) before generating the parser. Tokens with attributes should be written as nonterminals in the grammar, with one production whose right hand side is a single terminal, and the action procedure associated with that production should collect the attributes through cooperation with the lexical analyzer. The input grammar must be written in the following form: *terminals *productions *end Each terminal must be named in the *terminals section. If the name of the terminal is a reasonable identifier, then ParseGen will use it as it stands (Scheme) or prefix it with a z (C, Pascal, Java) to construct the symbol or enumerated value that must be returned for that terminal by the lexical analyzer. Special characters may also be used to name terminals, in which case ParseGen will choose the symbol or enumerated value itself. For example, ParseGen might choose the enumerated value "zassign" for a terminal named ":=", and the enumerated value "z97" for a terminal named "&&". The left hand side of the first production is assumed to be the start symbol. Each production should be written on a separate line of the form ::= # where is a single nonterminal, is a sequence of terminals and nonterminals, and is the name of the action procedure that will be called for that production. The may be omitted for all but the first production, in which case it is assumed to be the same as for the previous production. If N is an unsigned integer, then #N is equivalent to #action-N. The # may be omitted, in which case ParseGen will invent a name for the action procedure. For convenience, an action procedure may also be specified in the middle of a production. For example, A ::= array of #mk-type #mk-arraydecl is equivalent to A ::= B #mk-arraydecl B ::= array of #mk-type where B is a new nonterminal. Here is a complete example of an input grammar for ParseGen: Language of while loops. *terminals := ; ( ) addop begin do else end id if intcon mulop relop skip then while *productions ::= := #mk-assignment ::= skip #mk-skip ::= if then else #mk-if ::= while do #mk-while ::= begin end #cons ::= #mk-empty ::= ; #cons ::= #mk-boolexpr ::= E #identity E ::= T E2 #leftassociate E2 ::= #mk-none ::= T E2 #mk-partial T ::= F T2 #leftassociate T2 ::= #mk-none ::= F T2 #mk-partial F ::= #identity ::= #identity ::= ( E ) #identity ::= intcon #mk-intcon ::= id #mk-id ::= relop #mk-op ::= addop #mk-op ::= mulop #mk-op *end The names of the nonterminals and action procedures should be acceptable as identifiers to all programming languages for which ParseGen can generate code, except that: * Hyphenated identifiers, which are acceptable to Scheme, will be translated into most other languages by dropping the hyphen and capitalizing the letter that follows the hyphen. * Nonterminals and action procedures may be surrounded by angle brackets, as in . Such identifiers may contain spaces. When translated, the angle brackets will be dropped and the spaces will be replaced by underscores. Outputs from ParseGen. The file specified by the second argument will contain code for a recursive descent parser, together with declarations of the token, tokens, and nonterminal data types and a set of declarations and stubs for the action procedures. To obtain a complete parser, you must add any other type and data declarations that are necessary, a lexical analyzer, code for the action procedures, and a parse-error procedure that takes as arguments the nonterminal being parsed and the tokens that were expected but not found. A table of tokens and their symbolic names is written to the file specified by the third argument. This table specifies the symbolic names that the lexical analyzer must use to communicate the current token to the parser. A table listing the action procedures and their arguments is also written to the file specified by the third argument. Internal representations of grammars. The input to ParseGen is converted into a Scheme list of entries, one for each nonterminal. Nonterminals are represented by symbols, terminals by strings. Each entry is a list whose car is a nonterminal and whose cdr is a list of augmented productions for that nonterminal. Each augmented production is a list whose car is the right hand side of a production and whose cadr is the name of the action procedure for that production. Each right hand side is a list of terminals and nonterminals. For example, the entry for the nonterminal in the language of while loops might be represented as ( (( ":=" ) mk-assignment) (("skip") mk-skip) (("if" "then" "else" ) mk-if) (("while" "do" ) mk-while) (("(" ")") identity) (("begin" "end") identity)) The action procedure for a production may be the symbol *, in which case the right hand side of the production must consist of a single terminal and the production must be the only production for the nonterminal on its left hand side. In this case, the nonterminal is taken to be an alias for the terminal. For example, the above productions are equivalent to ( (( := ) mk-assignment) ((skip) mk-skip) ((if then else ) mk-if) ((while do ) mk-while) ((lparen rparen) identity) ((begin end) identity)) (:= ((":=") *)) (if (("if") *)) (then (("then") *)) (else (("else") *)) (while (("while") *)) (do (("do") *)) (lparen (("(") *)) (rparen ((")") *)) (begin (("begin") *)) (end (("end") *)) The first entry in the grammar is for the start symbol.