Intermediate Representations

March 2021

The main output of the frontend of the compiler is the intermediate representation (IR) of the input program. Different IRs encode different information about the program. IRs are organized around their structural organization, level of abstraction, and naming discipline.

Graphical IRs represent the underlying code as a graph:

A parse tree is a graphical representation for the derivation of the input program.
An abstract syntax tree (AST) is a simplification of the parse tree through the elimination of extraneous nodes.
A directed acyclic graph (DAG) is a further simplification of the AST that removes any duplications.
A control flow graph models control flow between the basic blocks in a program.
A dependence graph encodes flow of data from their definitions to their uses.
A call graph encodes runtime transfer of control between procedures.

Linear IRs represent the underlying code as a sequence of instructions that execute in their order of appearance:

Stack machine code pops operands off the stack and pushes the result of applying some operation on the popped operands back on to the stack.
Three address code use instructions of the form: i ::= j OP k.

Different IRs use different levels of abstractions. An AST might represent an array reference A[i, j] as a node with three children: [A], [i], and [j]. The same operation can be represented in linear IR as a sequence of ILOC operations. The difference between the two is the amount of context that is exposed --- an AST hides the low level details of how memory addresses are computed, whereas ILOC code exposes these details.

Naming discipline affects the type of code that can be generated by the compiler. For instance, using distinct names to hold intermediate results of computations can expose these results to analysis and transformation. On the other hand, reusing names can hide context.

Static single-assignment form (SSA) is a naming discipline used to encode information about both the flow of control and the flow of data values in a program. SSA introduces a phi function that accepts as inputs a set of distinct names, and outputs one of them depending on the runtime control flow at that point in the program. SSA simplifies and improves many optimization techniques, since each name is assigned once, and only once.

The memory model also affects the information that can be represented in an IR. In the register-to-register model, the compiler keeps values in registers aggresively (unless the semantics require it to store them in memory). In the memory-to-memory model, the compiler assumes that all values are kept in memory locations and move them from memory to a register just before they are used. Compilers for RISC machines tend to use the register-to-register model because it reflects the instruction set of RISC architectures more closely.

The compiler needs to store information about the entities manipulated by the program: variables, arrays, structures, procedures etc. This information may be stored in the AST or in a separate symbol table. The symbol table approach is attractive because it is more efficient; in the AST approach the compiler needs to navigate the AST to find the appropriate information. Symbol tables are typically implemented using hash tables.

Compilers use lexically scoped symbol tables to perform name resolution in languages that allow nested scopes. Each scope has its own symbol table that is instantiated when the scope is first entered. Each scope also has a pointer to its parent scope. When the compiler cannot find a name in the symbol table of the current scope, it recursively finds the name in the symbol table of the parent scope.