Programming Languages and Compilers

Lecture 4

In this lecture we will look at scanners or lexers as they are sometimes called. We will look at how we can build scanners by hand and how they can be generated automatically using tools such as JLex. We will also look at the JavaCC compiler compiler. This tool can help you generate the front-end of recursive decent compilers.

The slides for this lecture can be found here in ppt and here in pdf.

Literature

Watt and Brown, section 4.5

Sebesta section 3.1 to 3.4 and section 4.1 to 4.4

The JLex manual, by Elliot Berk. The manual can be downloaded from

http://www.cs.princeton.edu/~appel/modern/java/JLex/current/manual.html

The article in Java World: “Build your own language with JavaCC”, by Oliver Enseling, which can be downloaded from http://www.javaworld.com/javaworld/jw-12-2000/jw-1229-cooltools_p.html

As background reading I will recommend you read:

The JavaCC FAQ

You can download a free copy of JavaCC from the website JavaCC Home

The Java Tree Builder tool can be found on

JTB: The Java Tree Builder Homepage

The JLex system can be found on the following URL

http://www.cs.princeton.edu/~appel/modern/java/JLex/

You can find a C# version of JLex on the following URL: C# Lex Manual

An alternative LL(1) compiler generator is the Compiler Generator Coco/R. There are versions of CoCo/R for Java, C#, C++, Oberon, Modula-2 and Pascal.

An alternative LL(k) compiler generator is the ANTLR Parser Generator. There are versions of ANTLR for many languages. You can follow this link to An Introduction To ANTLR using Java.

You can find information about StreamTokenizer hos Sun:

StreamTokenizer (Java 2 Platform SE 5.0)

And here is an example usage:

Tokenizing Java Source Code (Java Developers Almanac Example)

Exercises

Exercises for lecture 4 will be done from 12.30 till 14.15 before Lecture 5 on Tuesday the 6th of March.

Individual Exercises

The following exercises you may prefer to do on your own, e.g. just after you have read the literature, and discuss the outcome with your group:

Do Watt and Brown exercise 4.17 page 134
Do Watt and Brown exercise 4.18 page 134
Do Watt and Brown exercise 4.19 page 134
Download and install JLex.

Try JLex on the sample grammar sample.lex:

http://www.cs.princeton.edu/~appel/modern/java/JLex/current/sample.lex

Download and install JavaCC. Look at the file Calc2i.jj Copy the file to an empty directory. Run Javacc on the file. Look at the .java files. Run javac *.java and java Calc2i

Try some of the JavaCC examples in examples directory in the JavaCC distribution

Look at the files eg1.jjt and eg4.jjt in the examples/JJTreeExamples directory. Copy the files to two new directories. Run jjtree on each of the files and look at the generated .jj files and .java files. The run javacc on the .jj file and javac *.jj. Then run java eg1, resp. java eg4

Group Exercises

The following exercises are best done as group discussions:

Discuss the outcome of the individual exercises
The lexical symbols of a programming language can be recognized by deterministic finite state automatons (DFA). These automatons can be described by state/transition diagrams where each node represents a state, and each edge a state transition ("circles-and-arrows"). Edges in a state diagram are labelled by lexical symbols that are read by the transitions. The start state can be marked by a special in-coming arrow, and final accepting states are often marked as "doubled" circles (see the slides from the lecture).

Given an alphabet A = { 0, 1 }, and the languages defined by the following rules (a) - (e), construct (by hand) a deterministic finite state automaton recognizing each language. Represent your automatons as state diagrams.

(a)	The string of three characters, 101.
(b)	All strings of arbitrary length that end in 101.
(c)	All strings that contain a 101 at least once anywhere.
(d)	All strings that contain no consecutive ones.
(e)	All strings in which the number of zeros is even.

Unsigned numbers in Algol-60 are given by the following regular description:

digit = '0'|'1'|'2'|'3'|'4'|'5'|'6'|'7'|'8'|'9'
integer = digit digit*
sign = '+'|'-'
exponent = 'e' (sign | empty) integer
number = integer('.' integer | empty) (exponent | empty) | exponent

Construct a DFA to recognize this language, and represent it as a state diagram. You may find it useful to construct a NDFA-ε, then convert it to a NDFA and finally convert it to a DFA

(a) Construct a state diagram for an DFA which accepts identifiers which obey the following rules.

The first character must be alphabetic (a letter); following characters may be alphabetic, numeric, or the underscore character; however, an underscore may not be final character, and two underscores may not be adjacent.

(b) Express the automaton as a regular expression. Use concatenation, alternation (|), closure (*), and, if needed, parentheses for grouping the items. You may find it helpful to introduce short-hand notation to represent any character that is a member of a small specified set, and another notation for a character that is not a member of a given set.

Compare writing a lexer by hand with using JLex. Compare with JavaCC (and possibly CoCo/R and/or ANTLR) and list pros and cons of each system.