A compiler is a program which reads instructions in the form of computer language written by a person and translates them into the machine code that a computer uses to execute. The computer language has a complex grammatical structure which must be parsed in order to construct the correct execution sequences and references. A special subprogram of the compiler performs this task. This is basically the first part of the compilers work, and this component of the compiler is called the parser. The parser has a set of language rules which it uses to decode the grammatical constructs that the programmer has written. The programmer and the parser both have to know the same rules. This makes the parser a very complex program. Actually too complex for anyone to write. In particular it is very hard to make changes to the parser if we want to change the language rules. So instead, we use another program, the parser generator, to write the parser.
Parser generators come in different flavours. This is because after the language is parsed we still have to do something with the parsed instructions. There are two choices. For a compiler, we generate the machine code instructions which our particular computer needs in order to execute the program. Alternatively we can execute the instructions as we parse them, in accordance with the input data. This is becoming a commonly used technique and such a program is called an interpreror. Writing compilers requires detailed knowledge of machine langue and is a highly speciased job done only within computer manufacturing companies these days. However interpretors can use other compiler languages or other interpreted languages to perform their instructions.
In this course we write a parser generator in the Tcl language. Compiler/parser generators exist written in C and C++ on virtually every hardware platform. The standard program used today is called Bison, another copmmonly used is Lemon. But for those who do not want to maintain a C/C++ development environment, but still want to play around with syntax and parsing, either from a teaching or an experimental viewpoint a Tcl parser generator is a useful program. As such it will only ever be a toy. When an industrial strength parser is required recourse must be made to Bison or Lemon . These programs are open source and available from GNU.
A parser generator is a complex program computationally. Indeed it is one of the more testing computational tasks, so this is not a trivial exercise. In order to cut the task into small manageable pieces we take the same approach as the early writers of compiler compilers. This is a bootstrap approach which assumes we may have to write the language in assembly code. Fortunately we dont have to do this. The process is split into two stages, following the early work of McKeeman, Horning and Wortman , "A Compiler Generator" (Prentice Hall 1970), probably long since out of print. In this model the syntax analyser reads the syntax rules and produces a table. The compiler consists of a scanner which uses the table to check the syntax of the program, together with an attached list of instructions saying what to do when each syntax element has been completely recognised. This can be a collection of user supplied procedures. In this model the syntax rules are seperated from the semantic rules which say how the syntax structure is to be interpreted.
Because the scanner is written in Tcl, we can easily supply Tcl action and integrate them into the scanner. This means that we cann create a computer language with an interpreter. Note that Tcl itself is not an example of such a languange. Tcl is so simple it does not use the analyser/scanner and does not need a parser. Tcl has only about 11 simple rules to its language. But we can use Tcl to build sophisticated computer languages.
Later we can look at defining a combined syntax and semantics definition language, but of course we need a compiler generator to do that, which we dont have yet. However Tcl provides some bascis to get started.
The first stage is to write the syntax analyser. Syntax is described by listing the rules of the syntax. There is a set of character sequences called terminals. These are the atoms of the language, and are what is seen on the written page. The structure of the language is described by assigning allowable groupings of terminals a more general name describing the grammatical import of the sequences of terminals. These more general names are called non-terminals. A rule is a sequence of terminals and non-terminals which represent an occurence of a general type of structure with its own non-terminal name. In a format historically used called BNF (Backus-Naur Form) the non-terminal is written on the left followed by the symbol ::= ( chosen so that it would not conflict with other combinations of special characters which might need to be used in a grammar ), followed by the sequence of terminals and non-terminals which are a permissable rule. Non-terminals are enclosed in chevrons <>, and often terminals are represented in bold case. The illustration above gives examples of two rules in this format. The +, the letter sequences "if" and "then" are terminals in this example of a language. You can have several rules describing one particular non-terminal, and a non-terminal can be an object of its own rule.
These are quite free rules and unfortunately are so free as to cause quite a few problems and ambiguities. The syntax analyser reports on these problems and the author of the language grammar strives to correct them and write an unambiguous grammar. Writing a language grammar is a special discipline and is quite unlike computer programming.
Although this project produces a general tool, we will follow through the project with a particular application. We have writte procedures for doing big-num arithmetic and may have employed these in a big-num integer calculator. But these procedures are not ready to use in a Tcl program itself, so here we provide a procedure similar to the Tcl [expr] command that accepts mathematical expressions that use big-num integers.
We trace the project through the stages in the diagram above.
First the syntax rules, then the syntax analyser,
then the syntax analyser which produces a specific syntax table for our integer expression
We look at the semantic rules
Then create the scanner that reads any integer expression and uses the syntax table and the semantic rules to output the proper result.