A Tcl Parser Generator

written July 2001 being revised May 2006

Google site search


1   Introduction

In this course we write a compiler generator in the Tcl language. Compiler generators exist written in C and C++ on virtually every hardware platform. The standard program used today in Bison. But for those who do not want to maintain a C/C++ development environment, but still want to play around with syntax and parsing, either from a teaching or an experimental viewpoint a Tcl compiler generator is a useful program. As such it will only ever be a toy. When an industrial strength compiler is required recourse must be made to Bison.

A compiler generator is a very complex program computationally. Indeed it is one of the more testing computational tasks, so this is not a trivial exercise. In order to cut the task into small manageable pieces we take the same approach as the early writers of compiler compilers. This is a bootstrap approach which assumes we may have to write the language in assembly code. Fortunately we dont have to do this. The process is split into two stages, following the early work of McKeeman, Horning and Wortman , "A Compiler Generator" (Prentice Hall 1970), probably long since out of print. In this model the syntax analyser reads the syntax rules and outputs a table. The compiler consists of a scanner which uses the table to check the syntax of the program, together with an attached list of instructions saying what to do when each syntax element has been completely recognised. This can be a collection of user supplied procedures. In this model the syntax rules are seperated from the semantic rules which say how the syntax structure is to be interpreted.

Later we can look at defining a combined syntax and semantics definition language, but of course we need a compiler generator to do that, which we dont have yet. However Tcl provides some bascis to get started.

The first stage is to write the syntax analyser. Syntax is described by listing the rules of the syntax. There is a set of character sequences called terminals. These are the atoms of the language, and are what is seen on the written page. The structure of the language is described by assigning allowable groupings of terminals a more general name describing the grammatical import of the sequences of terminals. These more general names are called non-terminals. A rule is a sequence of terminals and non-terminals which represent an occurence of general type of structure with its own non-terminal name. In a format historically used called BNF (Backus-Naur Form) the non-terminal is written on the left followed by the symbol ::= chosen so that it would not conflict with other combinations of special characters which might need to be used in a grammar, followed by the sequence of terminals and non-terminals which are a permissable rule. Non-terminals are enclosed in chevrons <>, and often terminals are represented in bold case. The illustration above gives examples of two rules in this format. The +, the letter sequences "if" and "then" are terminals in this example of a language. You can have several rules describing one particular non-terminal, and a non-terminal can be an object of its own rule.

These are quite free rules and unfortunately are so free as to cause quite a few problems and ambiguities. The syntax analyser reports on these problems and the author of the language grammar strives to correct them and write an unambiguous grammar. Writing a language grammar is a special discipline and is quite unlike computer programming.

©2000 - 2006 WEBSCOOL This page last updated 15 May 2006. All rights reserved - including copying or distribution of any portion of this document in any form or on any medium without authorisation. For more regarding the copyright.