Intermediate Code Generation
Intermediate code generation translates the annotated AST into an intermediate representation (IR), a lower-level but still machine-independent form. The IR is the input to optimization and code generation phases.
Why an Intermediate Representation?
Separation of concerns: the front end (language-specific) and back end (machine-specific) communicate through a common IR. An $m$-language, $n$-target compiler needs only $m + n$ components instead of $m \times n$.
Optimization: most optimizations are cleaner and more powerful on IR than on source or machine code.
Portability: the same IR can target multiple architectures.
Three-Address Code (TAC)
The most common IR for imperative languages. Each instruction has at most three operands: two sources and one destination.
Instruction forms:
x = y op z binary operation
x = op y unary operation
x = y copy
goto L unconditional jump
if x goto L conditional jump
if x relop y goto L comparison branch
x = y[i] indexed load
x[i] = y indexed store
x = &y address-of
x = *y pointer load
*x = y pointer store
param x push argument
call f, n call function f with n arguments
x = call f, n call with return value
return x function return
Temporaries: the compiler generates fresh temporary variables (e.g., t1, t2) to hold intermediate results.
Example: a = b * c + d * e
t1 = b * c
t2 = d * e
a = t1 + t2
Static Single Assignment (SSA) Form
Every variable is defined exactly once. If a variable has multiple definitions at different paths, use phi-functions at join points.
# Original
x = 1
if cond:
x = 2
# use x
# SSA form
x1 = 1
if cond:
x2 = 2
x3 = phi(x1, x2) # x3 = x1 if came from left, x2 from right
# use x3
Benefits of SSA:
- Simplifies data-flow analysis (each use has exactly one definition).
- Enables many optimizations: dead code elimination, constant propagation, value numbering.
SSA is the standard IR for LLVM, GCC’s GIMPLE, and most modern optimizing compilers.
AST to TAC Translation
Arithmetic expressions: generate a new temporary for each subexpression.
def gen_expr(node):
if node is BinOp:
t1 = gen_expr(node.left)
t2 = gen_expr(node.right)
t = new_temp()
emit(f"{t} = {t1} {node.op} {t2}")
return t
elif node is Var:
return node.name
elif node is Lit:
return str(node.value)
Boolean expressions with short-circuit evaluation:
# x && y: jump to false if x is false; don't evaluate y
if_false x goto false_label
if_false y goto false_label
result = 1
goto end
false_label: result = 0
end:
Control flow:
# if (cond) then S1 else S2
t = gen_expr(cond)
if_false t goto else_label
gen_stmt(S1)
goto end_label
else_label: gen_stmt(S2)
end_label:
While loop:
# while (cond) S
begin_label:
t = gen_expr(cond)
if_false t goto end_label
gen_stmt(S)
goto begin_label
end_label:
Other IR Formats
Stack-based IR (JVM bytecode): operands are pushed onto a stack; instructions pop their operands and push results. Compact; easy to generate; harder to optimize.
LLVM IR: typed SSA in text form. Explicit types for all values. Phi-nodes. Multiple function parameters. Rich type system (pointers, arrays, structs, vectors).
RTL (Register Transfer Language): GCC’s low-level IR. Represents machine instructions abstractly with virtual registers. Used in the back end.
Basic Blocks and Control Flow Graphs
A basic block is a maximal sequence of instructions with no internal branches: one entry, one exit.
Leaders: the first instruction, any branch target, and any instruction after a branch starts a new basic block.
Control flow graph (CFG): nodes are basic blocks; edges represent possible control flow. An edge from block $A$ to block $B$ if $B$ could execute immediately after $A$.
B1: t1 = x * 2; if t1 > 10 goto B3
B2: y = t1 + 1; goto B4
B3: y = t1 - 1; goto B4
B4: return y
The CFG is the substrate for data-flow analysis and most optimizations.