In this lecture we will look at scanners or lexers as
they are sometimes called. We will look at how we can build scanners by hand
and how they can be generated automatically using tools such as JLex. We will
also look at the JavaCC compiler compiler. This tool can help you generate the
front-end of recursive decent compilers.
The slides for this lecture can be found here in ppt and here in pdf.
Watt and Brown, section 4.5
Sebesta section 3.1 to 3.4 and section 4.1 to 4.4
The JLex manual, by Elliot Berk. The manual can be
downloaded from
http://www.cs.princeton.edu/~appel/modern/java/JLex/current/manual.html
The article in Java World: “Build your own language
with JavaCC”, by Oliver Enseling, which can be downloaded from http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-cooltools_p.html
As background reading I will recommend you read:
You can download a free copy of JavaCC from the
website JavaCC Home
The Java Tree Builder tool can be found on
JTB: The
Java Tree Builder Homepage
The JLex system can be found on the following URL
http://www.cs.princeton.edu/~appel/modern/java/JLex/
You can find a C# version of JLex on the following
URL: C# Lex Manual
An alternative LL(1) compiler generator is the Compiler Generator
Coco/R. There are versions of CoCo/R for Java, C#, C++, Oberon, Modula-2
and Pascal.
An alternative LL(k) compiler generator is the ANTLR Parser Generator. There are versions of
ANTLR for many languages. You can follow this link to An
Introduction To ANTLR using Java.
You can find information about StreamTokenizer hos
Sun:
StreamTokenizer
(Java 2 Platform SE 5.0)
And here is an example usage:
Tokenizing Java Source
Code (Java Developers Almanac Example)
Exercises for lecture 4 will be done from 12.30 till
14.15 before Lecture 5 on Tuesday the 6th of March.
The following exercises you may prefer to do on your
own, e.g. just after you have read the literature, and discuss the outcome with
your group:
Try JLex on the sample grammar sample.lex:
http://www.cs.princeton.edu/~appel/modern/java/JLex/current/sample.lex
The following exercises are best done as group
discussions:
Given an alphabet A = { 0, 1 }, and the
languages defined by the following rules (a) - (e), construct (by hand) a
deterministic finite state automaton recognizing each language. Represent your
automatons as state diagrams.
(a)
|
The string of three characters, 101. |
(b)
|
All strings of arbitrary length that end in 101. |
(c)
|
All strings that contain a 101 at least once
anywhere. |
(d)
|
All strings that contain no consecutive ones. |
(e)
|
All strings in which the number of zeros is even. |
digit = '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'
integer = digit digit*
sign = '+'|'-'
exponent = 'e' (sign | empty) integer
number = integer('.' integer | empty) (exponent | empty) | exponent
Construct a DFA to recognize this language, and
represent it as a state diagram. You may find it useful to construct a
NDFA-ε, then convert it to a NDFA and finally convert it to a DFA
The first character must be alphabetic (a letter);
following characters may be alphabetic, numeric, or the underscore character;
however, an underscore may not be final character, and two underscores may not
be adjacent.
(b) Express the automaton as a regular expression. Use
concatenation, alternation (|), closure (*), and, if needed, parentheses for
grouping the items. You may find it helpful to introduce short-hand notation to
represent any character that is a member of a small specified set, and another
notation for a character that is not a member of a given set.