Here is an introduction to all you need to know for LL(1) parsing correct input for CS164 at UC Berkeley.

Note: The following discussion is informal and does not cover all of the cases found in left factorization and LL(1) parsing grammars. To be sure you've covered all your bases, reread the text (Dragon book) and the lecture notes.

The following grammar comes from Written Assignment 2.

E -> E + T | T T -> T * F | F F -> (E) | int

This is a grammar for arithmetic. There are several orders of precedence here: ()'s beat *, and * beats +. We'd like to parse sentences using this grammar using a top-down, recursive descent algorithm.

This algorithm traverses the grammar looking for matches between terminals (*, +, (, ), and int) and the input sentence. We do this search depth-first, which presents a problem. If we start at the starting production E, and derive the production E + T, we have E still on the left. In recursive-descent parsing, we can only expand the left-most non-terminal at each step! We're going to infinitely loop if we try to parse sentences using this grammar.

How do we fix it? We use a technique known as left-recursion
elimination to get rid of the non-terminals on the left-hand side of
each production that cause us to infinitely loop. (Note: not all
grammars have left recursion. You can identify the ones that have
*immediate* left recursion by looking at all of the productions
-- if the non-terminal on the left side of the arrow is the same as
the non-terminal in the left-most position of any phrase on the right
side of the arrow, then this grammar is left-recursive. There are
other forms of left recursion that can show up if you were to
"recurse" down multiple rules in the grammar. If you eventually will
cause any infinite loop, the grammar is left-recursive.)

Let's take a look at the first one:

E -> E + T | T

What we do is this. For each production where the non-terminal on the left (E) of the arrow is the same as the left-side of a production on the right-hand side of the arrow (E + T), we take the part of the production without the E (+T) and move it down into its own new production (we'll call it E').

E' -> + T

We're not done yet. Now, after each of the new productions, add E' to the end.

E' -> + T E'

Nope, still not done yet. Now add an extra production to epsilon.

E' -> + T E' | epsilon

Good. Now we must fix up the original E productions. Here,
we take all of the right-hand sides that **didn't** start with
E, and add E' to the end of them.

E -> T E'

If we do this for the T productions as well, we get the following grammar:

E -> T E' E' -> + T E' | epsilon T -> F T' T' -> * F T' | epsilon F -> (E) | int

Note, the F production didn't get changed at all. That's because F didn't appear on the leftmost position of any of the productions on the right-hand side of the arrow.

Once we have performed left-recursion elimination on our grammar, we need to construct our parser. We're going to use an algorithm called LL(1) parsing. This is an algorithm that utilizes a lookup table to parse an expression. On top, we list all of the terminals, and on the left, we list the non-terminals (we include $ to signify end-of-file).

+ | * | ( | ) | int | $ | |

E | ||||||

E' | ||||||

T | ||||||

T' | ||||||

F |

Our LL(1) parser also contains a stack of non-terminals. Initially, we'll only put the starting state (E) on the stack. As we parse, we'll be putting non-terminals on and popping them off this stack. When they're all gone, and our stack is empty (and there is no more input), we're done.

At each step, there is a grammar symbol at the top of the stack. When we see an input token from the lexer, we look in the table to see what to do. Each cell in the table is going to tell our LL(1) parser what to do when it sees the terminal on the top when the non-terminal on the left is at the top of the stack.

Right now, our table is empty. Let's fill it up.

We do this by computing two functions called **FIRST **and
**FOLLOW**.

**FIRST** is a function on each non-terminal (E, E', T, T', and
F) that tells us which terminals (tokens from the lexer) can appear as
the first part of one of these non-terminals. Epsilon (neither a
terminal nor a non-terminal) may also be included in this **FIRST**
function. (**FIRST** is also defined for terminals, but its value
is just equal to the terminal itself, so we won't talk about it more.) What
this means is that the parser is going to invoke some of these
productions in the grammar. We need to know which one to pick when we
see a particular token in the input stream.

So, let's start computing. What is the **FIRST**(E)? What are
the terminals that can appear at the beginning of the stream when
we're looking for an E? Well, E -> T E', so whatever occurs at the
beginning of E will be the same as what happens at the beginning of
T.

FIRST(E) =>FIRST(T)

How about **FIRST**(E')? This one is easy, we have the terminal
+, and epsilon.

FIRST(E') = { +, epsilon }

And we'll continue with the others:

FIRST(T) =>FIRST(F)FIRST(T') = { *, epsilon }FIRST(F) = { (, int }

See? **FIRST**(F) is just the set of terminals that are at the
beginnings of its productions.

So, to sum up:

FIRST(E) = { (, int }FIRST(E') = { +, epsilon }FIRST(T) = { (, int }FIRST(T') = { *, epsilon }FIRST(F) = { (, int }

Now, let's do **FOLLOW**. Just as **FIRST** shows us the
terminals that can be at the beginning of a derived non-terminal,
**FOLLOW** shows us the terminals that can come *after* a
derived non-terminal. Note, this does *not* mean the last
terminal derived from a non-terminal. It's the set of terminals that
can come after it. We define **FOLLOW** for all the non-terminals
in the grammar.

How do we figure out **FOLLOW**? Instead of looking at the first
terminal for each phrase on the right side of the arrow, we find every
place our non-terminal is located on the right side of *any* of
the arrows. Then we look for some terminals. As we go through our
example, you'll see almost all of the different ways we figure out the
**FOLLOW** of a non-terminal.

First, however, let's pretend that our grammar starts with a unique starting production (it's not really part of the grammar):

S -> E

We start our journey at S, but rewrite it a bit to reflect the EOF that can be at the end. In parser-land, EOF is represented by $. So our production is really:

S -> E $

What is **FOLLOW**(E)? (Note: We don't really care about
**FOLLOW**(S) because it's just imaginary.) Look on all of the
right-hand sides (after the arrow) of all of the productions in the
grammar. What terminals appear on the right of the E? Well, I see a $
and a ).

FOLLOW(E) = { $, ) }

How about **FOLLOW**(E')? There's nothing after either of the E's
in the grammar. What do we do now? Well, if we had derived E'
from E in E -> TE', then whatever follows the E is the same
as whatever follows the E'. For the other production, we get that
whatever follows E' is the same as whatever follows E'. That's
a tautology.

FOLLOW(E') =>FOLLOW(E)

Now, let's do the rest:

FOLLOW(T) = ?

T is always followed by E'. So, whatever terminals begin E' must be
the terminals that follow T. So this means that **FOLLOW**(T) =>
**FIRST**(E'). It seems a bit weird, but since when we're done,
we'll have no more non-terminals in our stream, so only the terminals
are important to look at. Since T will disappear into some terminals,
those are the same that will tell us that we're starting to do E'.

FOLLOW(T) =>FIRST(E')FOLLOW(T') =>FOLLOW(T)FOLLOW(F) =>FIRST(T')

If we solve this system of equations, we get:

FOLLOW(S) = { $ }FOLLOW(E) = { $, ) }FOLLOW(E') = { $, ) }FOLLOW(T) = { +, epsilon }

Hold on! What's that epsilon doing there? We can't have that. Where
did it come from? Ah, it came from including **FIRST**(E'). Well,
if E' -> epsilon, then whatever terminals follow E'
(**FOLLOW**(E')) will be the terminals we're looking for. (These
are not the droids you're looking for.) These are $ and ). Let's add
those to our list:

FOLLOW(T) = { +, $, ) }FOLLOW(T') = { +, $, ) }FOLLOW(F) = { *, epsilon }

There's that epsilon again. If T' -> epsilon, then we want
to add **FOLLOW**(T') in there.

FOLLOW(F) = { *, +, $, ) }

Good, now we're done.

Now it's time to put these things back into our table. Each of the cells should be filled in with the production that the non-terminal on the left column takes when we see the terminal on the top in our input stream. Note: we don't have an S row in this table because it's not a very interesting transition. It just goes to E.

+ | * | ( | ) | int | $ | |

E | ||||||

E' | ||||||

T | ||||||

T' | ||||||

F |

Our **FIRST** calculations have told us exactly what we need! They
tell us what terminals we're allowed to see on the input stream
when we're looking for one of those non-terminals.

We'll reprint the **FIRST** equations to remind ourselves what
the values were:

FIRST(E) = { (, int }FIRST(E') = { +, epsilon }FIRST(T) = { (, int }FIRST(T') = { *, epsilon }FIRST(F) = { (, int }

So, let's fill in the first row:

+ | * | ( | ) | int | $ | |

E | can't happen | can't happen | E -> T E' | can't happen | E -> T E' | can't happen |

E' | ||||||

T | ||||||

T' | ||||||

F |

The production we put in is the only one available. Let's do another one.

+ | * | ( | ) | int | $ | |

E | can't happen | can't happen | E -> T E' | can't happen | E -> T E' | can't happen |

E' | + T E' | |||||

T | ||||||

T' | ||||||

F |

Wait a minute. We've got that epsilon in there. There's no
epsilon in the input stream! Remember what we did in the **FOLLOW**
function when we found the epsilon? We took the **FOLLOW** of that
production, **FOLLOW**(E'), which turned out to be ) and $. So, let's
take that epsilon production whenever we see ) or $.

+ | * | ( | ) | int | $ | |

E | can't happen | can't happen | E -> T E' | can't happen | E -> T E' | can't happen |

E' | + T E' | can't happen | can't happen | E' -> epsilon | can't happen | E' -> epsilon |

T | ||||||

T' | ||||||

F |

Let's fill in the next few lines:

+ | * | ( | ) | int | $ | |

E | can't happen | can't happen | E -> T E' | can't happen | E -> T E' | can't happen |

E' | + T E' | can't happen | can't happen | E' -> epsilon | can't happen | E' -> epsilon |

T | can't happen | can't happen | T -> F T' | can't happen | T -> F T' | can't happen |

T' | T -> epsilon | T' -> * F T' | can't happen | T' -> epsilon | can't happen | T' -> epsilon |

F | can't happen | can't happen | F -> ( E ) | can't happen | F -> int | can't happen |

And, we're done.

Now what do we do with this table? This table forms one part in a three part data structure. The other two parts are a stack of grammar symbols (E, E', T, T', F, +, *, (, ), int, and $), and an input stream (the expression we want to parse, already tokenized into lexemes by the scanner).

We start our stack with the starting non-terminal: E. Our input
will start with this example: `3 + 5 * 7`

.

Input |
Step # |
Stack (top is on left) |

3 + 5 * 7 $ | 1. | E |

Now, look at the table. If we have an E on the stack, and our input is a 3 (an integer), we pick the transition E -> T E'. This means, we pop E off the stack and replace it with T E'.

Input |
Step # |
Stack (top is on left) |

3 + 5 * 7 $ | 1. | E |

3 + 5 * 7 $ | 2. | T E' |

Note, we haven't done anything to the input yet. We're still only dealing with non-terminals. Now we have a T on the top of the stack, and an integer on the input. Yes, we go to T -> F T'.

Input |
Step # |
Stack (top is on left) |

3 + 5 * 7 $ | 1. | E |

3 + 5 * 7 $ | 2. | T E' |

3 + 5 * 7 $ | 3. | F T' E' |

Let's continue another step. If we have an F on the input stack and see an integer, we do F -> int.

Input |
Step # |
Stack (top is on left) |

3 + 5 * 7 $ | 1. | E |

3 + 5 * 7 $ | 2. | T E' |

3 + 5 * 7 $ | 3. | F T' E' |

3 + 5 * 7 $ | 4 | int T' E' |

Now we have an int at the top of the stack, and an int beginning the input stream. We pop both of these off.

Input |
Step # |
Stack (top is on left) |

3 + 5 * 7 $ | 1. | E |

3 + 5 * 7 $ | 2. | T E' |

3 + 5 * 7 $ | 3. | F T' E' |

3 + 5 * 7 $ | 4 | int T' E' |

+ 5 * 7 $ | 5. | T' E' |

Now we have a T' on the top of the stack, and a + on the input. T' -> epsilon. This means we pop T' off our stack and do nothing to the input.

Input |
Step # |
Stack
(top is on left) |

3 + 5 * 7 $ | 1. | E |

3 + 5 * 7 $ | 2. | T E' |

3 + 5 * 7 $ | 3. | F T' E' |

3 + 5 * 7 $ | 4 | int T' E' |

+ 5 * 7 $ | 5. | T' E' |

+ 5 * 7 $ | 6. | E' |

We now have an E' on the top of the stack, and a + on the input. Now we do the production E' -> + T E'.

Input |
Step # |
Stack (top is on left) |

3 + 5 * 7 $ | 1. | E |

3 + 5 * 7 $ | 2. | T E' |

3 + 5 * 7 $ | 3. | F T' E' |

3 + 5 * 7 $ | 4 | int T' E' |

+ 5 * 7 $ | 5. | T' E' |

+ 5 * 7 $ | 6. | E' |

+ 5 * 7 $ | 7. | + T E' |

And again, we have a terminal at the top of the stack, and a matching terminal on the input stream. So we pop them both, and continue.

Input |
Step # |
Stack (top is on left) |

3 + 5 * 7 $ | 1. | E |

3 + 5 * 7 $ | 2. | T E' |

3 + 5 * 7 $ | 3. | F T' E' |

3 + 5 * 7 $ | 4 | int T' E' |

+ 5 * 7 $ | 5. | T' E' |

+ 5 * 7 $ | 6. | E' |

+ 5 * 7 $ | 7. | + T E' |

5 * 7 $ | 8. | T E' |

At this point, we're in a pretty similar position to step 2. Let's do a few steps and see if you're still with us.

Input |
Step # |
Stack
(top is on left) |

3 + 5 * 7 $ | 1. | E |

3 + 5 * 7 $ | 2. | T E' |

3 + 5 * 7 $ | 3. | F T' E' |

3 + 5 * 7 $ | 4 | int T' E' |

+ 5 * 7 $ | 5. | T' E' |

+ 5 * 7 $ | 6. | E' |

+ 5 * 7 $ | 7. | + T E' |

5 * 7 $ | 8. | T E' |

5 * 7 $ | 9. | F T' E' |

5 * 7 $ | 10. | int T' E' |

* 7 $ | 11. | T' E' |

* 7 $ | 12. | * F T' E' |

Still with us? We parsed enough to pop the 5 off the input, and are now about to get rid of the * in another step.

Input |
Step # |
Stack
(top is on left) |

3 + 5 * 7 $ | 1. | E |

3 + 5 * 7 $ | 2. | T E' |

3 + 5 * 7 $ | 3. | F T' E' |

3 + 5 * 7 $ | 4 | int T' E' |

+ 5 * 7 $ | 5. | T' E' |

+ 5 * 7 $ | 6. | E' |

+ 5 * 7 $ | 7. | + T E' |

5 * 7 $ | 8. | T E' |

5 * 7 $ | 9. | F T' E' |

5 * 7 $ | 10. | int T' E' |

* 7 $ | 11. | T' E' |

* 7 $ | 12. | * F T' E' |

7 $ | 13. | F T' E' |

7 $ | 14. | int T' E' |

$ | 15. | T' E' |

$ | 16. | E' |

$ | 17. | DONE! |

Whew! Finished. We know we're done because the stack is now empty, and the input is at EOF ($). If there was more input and we ran out of stack symbols, we'd be in an error situation. And we'll leave that story for another time.

That's most of all you need to know for parsing correct input using LL(1). You should read the Dragon book for more detailed (and theoretical) information to understand the complete story.

Questions? Comments?

Andrew Begel abegel@cs.berkeley.edu