A Tcl Parser Generator

written May 2006
Google site search

ContentsIntroductionBasicIntermediateAdvancedFuturePolicyInfrastructure
You need the Tcl Plugin to read this page. If this spot  is not green (v2) or cyan (v3)you may want to download the latest Tcl Plugin.

2.0.1   Syntax        

The language which we are about to define is a simple one which defines what simple arithmetic expressions should look like. The application here is two-fold. Firstly to define a grammar for the big-num integers, but also one tat can be easily adapted to other arithmetic objectes such as qualtients or other more general kinds of objects. The difference between the specific integers and the more general is not in the grammar but in the way we ascribe meaning to the phrases of the grammar. For the integer application we can be quite precise, but for the more general case we dont say to much about what the phrases mean exactly but rather we say how the phrases can be combined in ways that are meaningful.

Grammar rules consist of the atoms of the language, called terminals, and rules which say how to combine the terminals into proper expressions or sentences. A rule consists of a name for a phrase, and a list of rules which say how a phrase can be constructed. The names of the phrases are called non-terminals and a rule can consist of a sequence of terminals and non-terminals mixed together in any order.

In this way we build up a structure of how we want the language to look. This is a constructive process. But the parser program has to do something quite different. given a sentance it has to be able to deconstruct it back to its primitive elements so that it can calculate a result. This is the opposite process. This is what we do when we read.

Writing a grammar is a special art because we do not have many models to draw on. We have natural language, and if we study a number of languages which are based on different grammatical structures, this can give us some idea of how to build sentences which have meaning. Much of natural language does not have precise unequivocal meaning, it is often ambiguous, or abstruse or open to interpreation, poetry in particular. But for the present we want the grammars which we give to computers to determine meaning precisely and uniquely.

There are a number of procedures for determining the meaning of a sentence. They all are based in the same principle. We look at the first word in the sentence and look amongst the rules for rules which permit this word to be at the start of a sentence, and at the start of a phrase. We then look at all the phrases that do match and make a list of all the words that can follow immediately after. This may involve expanding out some of the rules as we go until the list exists entirely of terminal. We then match the second word against this list. Typically there may be several matches, but we throw away all the rules in our list which do not match. We then apply the same process to this smaller set of rules and expand it out for all the third words that can appear in the phrase. At some point in this process we can reduce our initial list of rules for the first word down to one or maybe exclude all the rules. If none of the rules fit we say the sentences has a syntax error. It does not fit any of the syntax rules. If one rule remains in the list we fix on that as the phrase that begins the process and look further along the sence, either for more subphrases in this sentence or for a new sentance. As you can see this a complex process of checking lots of long lists of things, something that a computer is very good at.

There are a number of ways of doing this list checking and of deciding when the list has been reduced to obeen rule. In this project we used an algorithm called mixed precedence strategy (MPS). This strategy only looks a couple of phrases ahead, and if the rule to be applied cannot be recognised uniquely it fails.

This may sound like a limitation, but what we do is make sure we write a grammar which does indeed allow the structure of a sentence to be determined by looking only a few phrases ahead. This is the job of the syntax analyser process. This is not so unlike natural language. We determine the meaning of a sentence as we read it. We are not happy with sentences that fail to mean anything until we have reached the last word. We tend to use the form <subject> <predicate> and <subject> <verb> <object> and we use special words to indicate subphrases such as "where", "when", "that", "while", "which", the "wh..." words, and we use the small words "in", "to", "on" and so on for subphrases.

The truth is, we are not particularly good at dealing with sentence structures that are too convoluted.

For now we sketch out the rules for the integer arithmetioc expression. We want the usual operation, +, - , * for multiplication, / for divide without remainder, and % for modulus (remainder). We also want parenthesis ( ) to group subexpressions, and of course the numbers themselves. Tcl allows us to keep this simple. We do not need to include variables and functions in our syntax because we can use Tcl itself to handle this aspect.

Syntax ais normally written in a special form called Backus-Naur Form (BNF). Each rule consusts of a non-terminal followed by the special symbol ::= and then a sequence of terminlas and non-terminals. Typically the non-terminals are signalled by having chevrons < > around them. So typically a rules looks like this:

<add_expr> ::= <expr> + <expr>
It is common for rules to self-reference their lefthand non-terminal either directly or in some form of close circuit. For instance the rule above xcould have been written:
<add_expr> ::= <add_expr> + <add_expr>
to express the idea that we can have a sequence of expressions seperated by + signs which can added together without confusing the possible meaning. However it is not easy to describe this property to a computer. So we usually use the list convention to describe a string of similar phrases, like this:
<add_list> ::= <number>  + <add_list>
This is the usual convention and is called right recursive. This is the way to write grammars for an LALR parsing algorithm. Because of the way that the MPS parser looks ahead, it is better to use left recursion like this:
<add_list> ::= <add_list> + <number>
Here is part of the syntax for the integer expression:
<add_list> ::=  <add_list> <item> + 
<add_list> ::=  <item>  + 
<sub_list> ::=  <sub_list> <item>  - 
<sub_list> ::=  <item>  - 
<mult_list> ::=  <mult_list> <item> * 
<mult_list> ::=  <item> * 
<div_list> ::=  <div_list> <item> / 
<div_list> ::=   <item> / 
<item> ::= INTEGER
<item> ::=  ( <expr> ) 
If we write the syntax like this we will need to write a fancy decoder to convert this into input suitable for our program. Tcl has the nice property that it is easy to turn data like this into program, just put an appropriate procedure name at the head of each line and make the lines look like strings of words. So to simplify matters from the outset we change the format a bit to this:

The rules have been wrapped into a codeblock which is fed to a procedure called analyse. [analyse] reads the codeblock as an ordinary list of tcl commands. The code block consists of a sequence of rule commands. [analyse] does something special: when it has finished reading the rule commands it then analyses the data structure that has been built up by the rule coomands, reports on grammar and returns a table of values to be used by the scanner procedure at runtime.
# Syntax for integer procedure
analyze {
rule expr {  add_list sub_list  item }
rule expr { add_list item }
rule expr { sub_list item }
rule expr { mult_list div_list item }
rule expr { mult_list item}
rule expr { div_list item}
rule expr { item % item }
rule expr { item }
rule add_list  { add_list item + } 
rule add_list  { item  + }
rule sub_list { sub_list item  - }
rule sub_list {  item  - }
rule mult_list { mult_list item * }
rule mult_list  { item * }
rule div_list {  div_list item / }
rule div_list { item / }
rule item INTEGER
rule item { ( expr ) }
}
Each rule has been turned into a Tcl command called [rule] which has two arguments: the lefthand non-terminal and the righthand expansion of the grammar written as a simple Tcl list. This list does not readily reveal which words represent non-terminals. The [analyse] procedure works this out for itself. Basically any word that does not appear on the left side must represent a terminal. Left hand non-terminals that do not appear on the right-hand side are assumed to describe correct sentences in the grammar. But by convention the lefthand of the first rule is assumed to define correct sentences (even if it appears on the right hand side). This is called the goal symbol.

On the next page we look at what the syntax analyser does.

Prev Next

©2000 - 2006 WEBSCOOL This page last updated 29 May 2006. All rights reserved - including copying or distribution of any portion of this document in any form or on any medium without authorisation. For more regarding the copyright.