Lexical Analysis
Lexical analysis (scanning) is the first phase of compilation. It reads the raw character stream of the source program and produces a sequence of tokens, which are the atomic units of meaning in the language.
Tokens
A token is a pair (token class, attribute value).
Token classes: keyword (if, while, return), identifier, integer literal, float literal, string literal, operator (+, <=, ==), delimiter (;, {, (), whitespace (usually discarded), comment (usually discarded).
Examples:
x = y + 3.14;
| Lexeme | Token class | Attribute |
|---|---|---|
x |
ID | “x” |
= |
ASSIGN | |
y |
ID | “y” |
+ |
PLUS | |
3.14 |
FLOAT_LIT | 3.14 |
; |
SEMICOLON |
The parser works with token classes; attribute values are stored in the symbol table or AST.
Regular Expressions
Token patterns are described by regular expressions (regex).
Operations:
- Concatenation:
abmatches “ab”. - Union:
a|bmatches “a” or “b”. - Kleene star:
a*matches “”, “a”, “aa”, … - One or more:
a+=aa*. - Optional:
a?=a|ε. - Character class:
[a-z]matches any lowercase letter.
Examples:
- Integer literal:
[0-9]+ - Identifier:
[a-zA-Z_][a-zA-Z0-9_]* - Float literal:
[0-9]+\.[0-9]+([eE][+-]?[0-9]+)? - C-style comment:
/\*([^*]|\*+[^*/])*\*+/
Finite Automata
Regex patterns are recognized by finite automata (FA).
NFA (Nondeterministic Finite Automaton): states connected by labeled transitions; multiple transitions on the same input and epsilon (empty) transitions are allowed. Easy to construct from regex (Thompson’s construction).
DFA (Deterministic Finite Automaton): exactly one transition per state per input symbol; no epsilon transitions. Fast to simulate ($O(n)$ for an $n$-character input).
NFA to DFA (subset construction): each DFA state is a set of NFA states. The DFA state for a set $S$ has a transition on input $a$ to the set of all NFA states reachable from any state in $S$ via $a$ and then epsilon transitions.
DFA minimization: merge equivalent states (states that behave identically for all inputs). Reduces DFA size; Hopcroft’s algorithm runs in $O(n \log n)$.
Lexer Construction
Thompson’s construction: given a regex, build an NFA.
- Base case: single character $a$ becomes a two-state NFA.
- Concatenation: connect the accept state of NFA1 to the start state of NFA2 via epsilon.
- Union: new start state with epsilon transitions to both NFA starts.
- Kleene star: add epsilon back-loop and bypass.
Pipeline:
Regex -> NFA (Thompson) -> DFA (subset construction) -> min DFA -> lexer code
Longest match rule: the lexer takes the longest possible match at each position. >= is one token (GREQ), not > followed by =.
Keyword priority: when the identifier pattern and a keyword pattern both match (e.g., if), keywords take priority. Implemented by checking the identifier value against a keyword table.
Lexer Generators
Tools that generate lexers automatically from regex specifications.
lex / flex: C-based lexer generator. Specify token patterns with actions in a .l file; flex generates a C function yylex().
[0-9]+ { yylval.ival = atoi(yytext); return INT; }
[a-zA-Z_]+ { return lookup_keyword(yytext); }
[ \t\n]+ { /* skip whitespace */ }
ANTLR: generates lexers and parsers in multiple languages (Java, Python, C#).
Hand-written lexers: used in production compilers (GCC, Clang, Rust) for speed and better error messages. Implemented as a switch on the current character with manual state tracking.
Handling Lexer Complexity
String literals: accumulate characters between quotes; handle escape sequences (\n, \\, \").
Multi-line comments: /* ... */. Requires tracking nesting depth (C does not allow nested; some languages do) and line numbers for error messages.
Line and column tracking: increment line counter on \n; reset column. Store position at token start for error reporting.
Lookahead: some tokens require looking one character ahead. > vs >=: read >; peek at next character; emit GREQ if =, else emit GT and return the peeked character to the input.