Introduction to Compilers
A compiler is a program that translates source code written in one language into an equivalent program in another language, typically from a high-level language to machine code or an intermediate representation. Understanding compiler construction reveals how programming languages work and how programs are transformed into efficient executable code.
What a Compiler Does
A compiler reads source text and produces target code, preserving the meaning of the source program. The process is structured as a sequence of phases, each building on the output of the previous one.
Source program
-> Lexical Analysis (characters -> tokens)
-> Syntax Analysis (tokens -> parse tree)
-> Semantic Analysis (parse tree -> annotated AST)
-> IR Generation (AST -> intermediate representation)
-> Optimization (IR -> optimized IR)
-> Code Generation (IR -> target code)
-> Target program
Each phase can detect errors and report them with source location information.
Compiler vs. Interpreter
Compiler: translates the entire program before execution. The target program runs independently of the compiler. Examples: C, C++, Rust, Go.
Interpreter: reads and executes source statements one at a time (or after partial compilation). Source (or bytecode) is needed at runtime. Examples: Python (CPython), Ruby, JavaScript (without JIT).
Hybrid (bytecode + JIT): compile to bytecode (Java .class, Python .pyc); interpret bytecode at runtime; JIT-compile hot paths to native code. Examples: JVM, CPython with JIT (PyPy), V8 (JavaScript).
| Property | Compiler | Interpreter | JIT |
|---|---|---|---|
| Execution speed | Fast | Slow | Near-native |
| Startup time | Slow (compile once) | Fast | Medium |
| Portability | Platform-specific | Portable | Portable |
| Error reporting | Compile time | Runtime | Mixed |
Compiler Phases in Detail
Front end: analysis phases that are language-specific and machine-independent.
- Lexical analysis (scanning).
- Syntax analysis (parsing).
- Semantic analysis.
Middle end: language-independent, machine-independent transformations.
- IR generation.
- Optimization.
Back end: machine-specific code generation.
- Instruction selection.
- Register allocation.
- Instruction scheduling.
Symbol Table
A data structure maintained throughout compilation that maps identifiers to their attributes: type, scope, memory location.
Scope rules: the symbol table manages scopes (nested scopes in block-structured languages). When entering a block, push a new scope. When leaving, pop it.
Attributes stored: name, kind (variable, function, type), data type, scope level, memory offset or register, function signature (parameter types, return type).
Error Handling
A compiler must detect and report errors without crashing or producing incorrect code.
Lexical errors: invalid characters, unterminated strings.
Syntax errors: mismatched parentheses, missing semicolons. A good parser recovers and continues parsing (panic mode recovery, phrase-level recovery).
Semantic errors: type mismatches, undefined variables, redeclared identifiers.
The compiler should: report the error with source line and column, recover to find further errors, avoid cascading false errors.
Example Languages and Compilers
| Language | Compiler | Notes |
|---|---|---|
| C | GCC, Clang | LLVM IR backend |
| C++ | G++, Clang++ | Complex front end (templates) |
| Rust | rustc | Borrow checker in semantic analysis |
| Java | javac | Compiles to JVM bytecode |
| Swift | swiftc | LLVM backend |
| Go | gc | Fast compilation; single binary |
LLVM
The dominant modern compiler infrastructure. A reusable, modular collection of compiler and toolchain technologies.
Architecture: language-specific front ends (Clang for C/C++/ObjC; rustc; swiftc) lower source to LLVM IR. The LLVM middle end optimizes IR. Multiple back ends generate machine code (x86-64, ARM, RISC-V, WebAssembly).
LLVM IR: strongly typed, static single assignment (SSA) form, infinite virtual registers. Platform-independent; enables powerful cross-language optimization.