Writing expression parser using ANTLR

This article is to write simple expression parser using ANTLR. I consider the reader of this article is well build of JAVA. Also I consider that the reader knows, clearly what is expression, expression parser and grammar.

1 What is ANTLR?

ANTLR, is a second generation parser generator1.Both parsers were designed and implemented by Terence Parr. It is a language tool that provides a framework for constructing recognizers, compilers, and translators from grammatical descriptions containing C++ or Java actions (You can use PCCTS 1.xx to generate C-based parsers). The latest version of ANTLR 2.7.2 is now available at http://www.antlr.org/. We use ANTLR to produce Java code for our expression. It is released under GNU licesnse.

1.1 Install ANTLR

Download the compressed tar or zip file from the ANTLR site. Then unzip all to a folder and set the classpath variable to the folder location. If jar files are not available in the folder antlr-2.7.2a2, then run two batch files, located in the directory antlr-2.7.2a2, they will create jar files.

1.2 Run ANTLR

ANTLR grammar files should have with extentions “.g”, to mean that they are grammer file. Example “Parser1.g”.

To run ANTLR, =java antlr.Tool =. Now let us disscuss how to evaluvate a basic mathematical expression, without variables support.

2 Building an expression Parser

2.1 Framing the Syntax

Now we start our hunt..

We should analyze how our expression will look like. Some of our example expressions are, 1 + 2 , 1 * 2, 1 * 2 / 3, 1 + 2 * 3 / 4, 1 + 2 ^ 3, ( 1 + 2 ) * 3, ( 2 * 3 ) ^ 4. in the expression, we use numbers and symbols.

The meaning for the symbols are.

</col>
SymbolsMeaning
()Paranthesis
*Multiplication
+Addition

2.2 Backus Naur Form

<expression> ::= <expr> ‘;’

<expr>      ::=   <term>

<term>      ::=   <factor> [ + <factor> ] [ - <factor> ]

<factor>    ::=   <power> [ * <power> ] [ / <power> ]

<power>     ::=   <neg_num> [ ^ <neg_num> ]

<neg_num>   ::=   <number> [ - <number> ]

<number>    ::=   <digit> [ . <digit> ]

<digit>     ::=   0..9

2.3 Lexical syntax & rules

Symbols we use in ANTLR, we only see some of the basic rule what we use in our program. There are some more rules other than this.

</col>
SymbolMeaning
(…)Sub rule
(…)*Match may occur zero times, one ime or more than one time.
(…)+Positive closure sub rule, occur atleast one time.
(…)?Optional
[…]Rule arguments
\Alternative operator
..Range operator
~Not operator
=Assignment operator
:Lable operator, rule start
;Rule end operator
classGrammar class
extendsSpecifies grammar base class
returnsSpecify the return type of rule
optionsOptions section
tokensTokens section
headerHeader section,defn of packages and import items.
#Root node operator

Range operator : The range binary operator implies a range of atoms may be matched.

Example :

'A' .. 'Z'- All the letters between 'A' to 'Z' 0 .. 9 - All the numbers between 0 to 9 'a' .. 'z' - All the letters between 'a' to 'z'

Alternative operator :

rulename
    :   alternative_1
    |   alternative_2
   ...
    |   alternative_n
    ;

First alternative1 is matched, if it fails it comes to alternative2, and so on until alternativen.

Not operator(Complement operator): The "~" not unary operator must be applied to an atomic element such as a token identifier. For some token atom A, ~A matches any token other than A except end-of-file. Within lexer rules, ~'a' matches any character other than character 'a'.

Rule for single line comment : it will match all the characters, but not the newline and carriage return character, and the rule will end with a newline or a carriage return character. It is a rule for single line comment.

SL_COMMENT : "//" (~('\n'|'\r'))* ('\n'|'\r);=

Rule for String : A string should first start with double quotes(“) then may or may not contain any characters, including the escape character, but not the double quotes(“) or “’” in between and finally end with a double quote(“).

STRING : '"' (ESC | ~('\\'|'"'))* '"';=
Protected ESC : '\\' ('n' | 'r');

AST root operator ( ^ ) : When generating abstract syntax trees (ASTs), token references suffixed with the "^" root operator force AST nodes to be created and added as the root of the current tree. This symbol is only effective when the buildAST option is set.

AST exclude operator ( ! ) : When generating abstract syntax trees, token references suffixed with the "!" exclude operator are not included in the AST constructed for that rule. This symbol is only effective when the buildAST option is set.

Example:

1 + 2 * 3; in this 1, +, 2, * , 3 should be in the tree since they should be evaluvated , can suffixed with ^ operator, and ';' semi colon is not necessary, we don’t do anything with ';', so we can exclude that with a '!' operator, it is just simply to say the end of expression.

Closure rule (…)**: This rule specifies, that the matching character may occur zero times, one time or more than one time.

*See rule for String_

Rule for Variable: a variable should start with a capital letter, may contain another capital letter or number in it.

VARIABLE :   'A'..'Z' ('A'..'Z' | '0'..'9')*
         ;

Positive closure rule (..)+: This rule will say that, the matching item should atleast occur one time and may occur more than one time.

Rule for a Integer : should atleast contain one digit

INT : ('0'..'9')+ ;

Optional closure rule (..)? : This rule will say that, the matching item , may present or may not present, i.e it is optional.

Rule for both Integer and Real number :

NUMBER      :   INT ( '.' (INT)? )?  ;

INT : ('0'..'9')+ ;

Skipping characters : To have the characters matched by a rule ignored, set the token type to Token.SKIP.

Rule for skipping white spaces:

WS : ( ' ' | '\t' | '\n' { newline(); } | '\r' )+

{ $setType(Token.SKIP); }

;

these are whitespaces, not necessary for processing in some situations.

Lexer rules : Rules defined within a lexer grammar must have a name beginning with an uppercase letter. These rules implicitly match characters on the input stream instead of tokens on the token stream.. further, lexer rules can also have local variables and use recursion.

Parser rules: Parser rules apply structure to a stream of tokens whereas lexer rules apply structure to a stream of characters. Parser rules, must not reference character literals. Double-quoted strings in parser rules are considered token references. Note: All parser rules must begin with lowercase letters.

Tree-parser rules : In a tree-parser, an additional special syntax is allowed to specify the match of a two-dimensional structure.

= rule : #(A B C); =

which means "match a node of type A, and then descend into its list of children and match B and C".

Example: Lexical parser

class MyLexer extends Lexer;

WS    : (    ' ' | '\t' | '\n' | '\r' )

            { _ttype = Token.SKIP; }

      ;

// OPERATORS

LPAREN      :     '('   ;     RPAREN            :     ')'         ;

SEMI        :     ';'   ;     DOT               :     '.'         ;

DIV         :     '/'   ;     PLUS              :     '+'         ;

MINUS       :     '-'   ;     STAR              :     '*'         ;

POWER       :     '^'   ;    

//for both integer and real number

NUMBER

      :     (DIGIT)+ ( '.' (DIGIT)+ )?

      ;

//for numbers

protected

DIGIT

      :     '0'..'9'

      ;

2.4 Parser synatax & rules

class MyParser extends Parser;

options {

      buildAST = true;  // uses CommonAST by default

}

statement

      :     expression SEMI!

      ;

expression

      :     additiveExpression

      ;

// addition/subtraction

additiveExpression

      :     multiplicativeExpression((PLUS^ | MINUS^) multiplicativeExpression)*

      ;

// multiplication/division   

multiplicativeExpression

      :     powerExpression ( (STAR^ | DIV^ ) powerExpression )*

      ;

powerExpression  

      :     unaryExpression ( POWER^ unaryExpression)*

      ;

unaryExpression

      :     MINUS^ {#MINUS.setType(UNARY_MINUS);} unaryExpression

      |     primaryExpression

      ;

primaryExpression

      :     NUMBER

      |     LPAREN! additiveExpression RPAREN!

      ;

So now we can easily identify these rules. #MINUS.setType(UNARYMINUS),used to specify that it is not ordinary minus, it is unary minus.

Now TREE PARSER… part

Expression Tree Evaluation part

class MyTree extends TreeParser;

expression returns [double r = 0]

{

      double a,b;

}

:     #(PLUS      a=expression b=expression)    { r = a+b; }

|     #(MINUS     a=expression b=expression)    { r = a-b; }

|     #(STAR      a=expression b=expression)    { r = a*b; }

|     #(DIV      a=expression b=expression)    { r = a/b; }

|     #(POWER     a=expression b=expression)    { r = Math.pow(a,b); }

|     #(UNARY_MINUS     a = expression)         { r = -a; }

|     i:DOUBLE    { r = Double.valueOf(i.getText()).doubleValue(); }

;

The tree parser is used to evaluvate , i.e walking in the tree. It visits each node and identifies them as we specify. It is recursive, so, the expression is evaluvated recursive and the value is obtained.

Example woking of tree parser:

# (PLUS      a=expression b=expression)    { r = a+b; }

Finds PLUS in the node, then descends to it's child. Statements in { } are pure java statements, to say what could the operation is performed, when the above rule is satisfied.

Save the file in some name, there's no rule for file saving as in java. Let us say the file is MyCalc.g. Now we finished constructed the grammar. Now we need to generate Java source file. = java antlr.Tool MyCalc.g=

Calc.java

import java.io.*;

import antlr.*;

import antlr.collections.*;

class Calc {

      public static void main(String[] args) {

            try {

            MyLexer lexer = new MyLexer(new DataInputStream(System.in));

                  MyParser parser = new MyParser(lexer);

                  // Parse the input expression

                  parser.statement();

                  CommonAST t = (CommonAST)parser.getAST();

                  // Print the resulting tree out in LISP notation

                  System.out.println(“Tree :” + t.toStringTree());

                  MyTree tree = new MyTree();

                  System.out.println(“Value:” + tree.expression(t));

            }

            catch(TokenStreamException e) {

                  System.err.println("exception: "+e);

            }

            catch(RecognitionException e) {

                  System.err.println("exception: "+e);

            }

      }

}

Some example expressions:

Expression : (1 + 2) * 3 - ( 4 / 5 * 6 ) ^ 2;

Tree : ( - ( * ( + 1 2 ) 3 ) ( ^ ( * ( / 4 5 ) 6 ) 2 ) )

Value : -14.040000000000006

Expression : 1 + 2 * 1000 / 4 ^ 2 * ( 5 + 20 );

Tree : ( + 1 ( * ( / ( * 2 1000 ) ( ^ 4 2 ) ) ( + 5 20 ) ) )

Value :3126.0

Footnotes:

1 ANTLR is Another tool for Language Recoginition