|
Parsing Simple Data Terence Parr This entry leads you through the development of a simple parser for reading in name/age pairs. You'll see how to use
You will print out the contents of an input file in a certain format. The input is: Griffin 2 Terence 33 Karen 31 Tom 32 Pat 26Parsing
A grammar to recognize a sequence of name/age pairs is pretty simple.
Define a grammar class derived from class P extends Parser; // match one-or-more 'name followed by age' pairs startRule : ( NAME AGE )+ ;
We will define the references to It's probably worth mentioning at this point that tools such as awk and PERL work on a line by line basis, not character by character. Hence, you could write a quick parser to read name/age pairs if they happen to be consistently on a line together or on separate lines. Here is a quick awk invocation that works if the data is on one line: awk '{print "("$1","$2")"}' < input Unfortunately, it will print "(,)" for empty lines at the end also. A real parser, on the other hand, looks only at a stream of vocabulary symbols (tokens); they generally ignore whitespace unless it is part of the vocabulary such as in Python or make (tabs count--ick). Anyway, "right tool for the right job" etc...
Grammar class P extends Parser; // match one-or-more 'name followed by age' pairs startRule : ( n:NAME a:AGE {System.out.println("("+n.getText()+","+a.getText()+")");} )+ ;Lexical Analysis A parser applies a grammatical structure to a sequence of input tokens, but where do the input tokens come from? From a beast called a lexical analyzer or lexer that scans the input characters looking for patterns. [Sounds like a parser doesn't it? It is, but lexers parse characters not tokens. ANTLR treats lexical analysis, parsing, and tree parsing as simply variations on recognition. This generalization provides a pleasing notational consistency across recognizer types.]
To define a vocabulary of tokens, create another grammar class, but
this time derive it from // Lexer framework (rule contents are missing here) class L extends Lexer; NAME: ... ; AGE : ... ; WS : ... ; // white space, which will be ignoredTo match any character in a range, use the range operator; for example, '0'..'9' matches a digit.
Therefore, to match a series of
digits, we can embed the range in the (...)+
positive closure (one-or-more) block or "loop". AGE
can then be completely specified as:
// match a decimal age of any length AGE : ( '0'..'9' )+ ;Similarly, NAME can be specified as:
// match an upper/lower case name of any length NAME: ( 'a'..'z' | 'A'..'Z' )+ ;The alternative operator ' | ' is used to indicate
that either range can be matched. Each time through the loop
the lexer will match either a lowercase or an uppercase letter.
Whitespace such as tab, space, and newline is very often ignored
by parsers. Rather than have the parser try to match and ignore
whitespace tokens everywhere, it is customary for the lexer to
match and throw out whitespace. Ignoring whitespace in the lexer
is a matter of specifying what constitutes whitespace and then
setting the token type for that token to be a special value:
WS : ( ' ' | '\t' | '\r' '\n' { newline(); } | '\n' { newline(); } ) {$setType(Token.SKIP);} //ignore this token ;The Lexer has a predefined notion of newline or carriage return that is used in both the lexer and the parser for error handling and by user actions. A standard method newline can be called
to increment the current line number.
This concludes both the parser and lexer. Easy, right? Put them
both in the same file called $ java antlr.Tool t.g ANTLR Parser Generator Version 2.20 1997 MageLang InstituteTo invoke the parser, a main program such as the following suffices: import java.io.*; class Main { public static void main(String[] args) { try { L lexer = new L(new DataInputStream(System.in)); P parser = new P(lexer); parser.startRule(); } catch(Exception e) { System.err.println("exception: "+e); } } } Compile everything and run the input file through the program as standard input. You should get something similar to the following: $ javac *.java $ java Main < input (Griffin,2) (Terence,33) (Karen,31) (Tom,32) (Pat,26) $Conclusion This Field Guide entry demonstrates all of the basic elements needed to build a typical parser and lexer. Future entries are more terse, but develop more complicated and useful recognizers and translators. Source The following files contain the complete solution to the problem discussed here. |