Back

Lexical Analysis

March 2021

Scanner

The scanner (or lexical analyzer) reads a stream of characters and produces a stream of words. It determines whether a group of characters in the input stream is a valid word, and if so, assigns it a syntactic category.

Recognizers, DFAs, Regexes

The model that the scanner uses to perform lexical analysis is called a recognizer. A recognizer is a deterministic finite automaton (DFA). A valid word in the input language takes the DFA to an accept state. Hence, a stream of characters can be transformed into a stream of words by feeding the character stream into a DFA and keeping track of all instances when the DFA enters an accept state.

However, specifying a language in terms of DFAs can be inconvenient. Instead, compiler writers use regular expressions. Regexes, much like DFAs, define a set of valid words. In fact, any regexes can be translated into an equivalent DFA via a series of algorithms --- one that converts the regex into a nondeterministic finite automaton (NFA), followed by one that converts an NFA into a DFA by subset construction, followed by another one that minimizes a DFA by partitioning its states into equivalence classes.

Implementing the Scanner

There are several strategies for implementing scanners. One approach is table-driven. A scanner generator accepts a set of regular expressions, converts them into a DFA, and uses this DFA to generate and fill in a set of tables. These tables will drive a skeleton scanner, which consists of four sections: initialization, a scanning loop, a roll back loop, and a final section that interprets and reports the result.

An alternative to the table-driven approach is to use a hand-coded scanner. Hand-coded scanners replace the explicit DFA transitions specified by the tables with an implicit one specified in terms of the control flow of the scanner program. Hand coding eliminates memory references and allows other specializations.

For instance, consider a procedure that advances the input character stream and returns the next character in the stream whenever it is called. In a hand-coded scanner, this procedure can be replaced with one that returns a buffer of characters rather than just one charcater on each call. This reduces the amount of overhead due to procedure calls.