Syntactic Analysis

March 2021

Parser

The parser (or syntactic analyzer) reads a stream of words and determines whether the stream of words constitutes a syntactically valid program. It does so by building a parse tree. This parse tree is also used by the compiler to analyze and optimize the input program (see Intermediate Representations).

The syntactic rules of a language are specified using a formalism called context-free grammars. A program is syntactically correct if there exists a derivation for the sequence of words that make up the program.

Starting with a prototype string that consists of only the starting symbol, additional strings are derived by replacing one nonterminal in the prototype with the RHS of some production that has as its LHS the matching nonterminal.

The parser finds a derivation by building a parse tree. This parse tree has as its root node the starting nonterminal symbol, and its leaves the sequence of words that make up the program. The goal of the parser is to build the part of the tree between the root and the leaves.

Top-Down and Bottom-Up Parsing

A parser can build the parse tree from top to bottom, or from the bottom to the top. A top-down parser expands nodes, starting at the root node; a bottom-up parser reduces nodes, starting from the leaves.

At each step of the parse, the parser has to select the correct production to use to expand or reduce the nodes with. In top-down parsing, we apply the backtrack-free constraint to the CFGs to enforce this behavior. With a backtrack-free grammar, the parser will always pick the correct production for expansion (if it exists), using a one symbol lookahead.

There are two strategies for implementing top-down parsers. Recursive descent top-down parsers use procedures (one for each nonterminal) that call each other recursively, advancing the input stream whenever they recognize the appropriate symbols. A table-driven LL(1) parser uses a generated parsing table to drive a skeleton parser.

An alternative to top-down parsing is bottom-up parsing. A bottom-up parser has to find a valid substring in the input stream and find a production with the matching RHS, before it can reduce the substring into the LHS of that production.

An LR(1) parser is an implementation of a table-driven bottom-up parser. The grammars used by LR(1) parsers do not need to be constrained. Instead, the parser stack, which encodes the upper frontier of the partially build tree, is interleaved with additional state information that stores some local left-hand context about the parsing history thus far. The skeleton parser uses the stack and a set of parsing tables to determine what to do next --- either to reduce a substring in the frontier or to push more symbols onto the stack to gain more context.