Compilers are often viewed as the dark magic of computer science. You feed them human-readable text, and out pops a binary executable that a CPU understands natively. But behind the curtain, it's just a series of deterministic text transformations. Today, we're going to pull back that curtain.
In this deep dive, we won't just talk about high-level concepts. We are going to roll up our sleeves and write actual Python code to compile a minimal C program into real, executable x86-64 assembly.
Our goal? To successfully compile and run this exact C code:
int main() {
return 42;
}Let's break down the architecture of our compiler into three distinct, manageable phases: Lexical Analysis, Parsing, and Code Generation.
Phase 1: Lexical Analysis (The Lexer)
A CPU has no concept of "words", "keywords", or "punctuation". To a computer, our C code is just a flat array of ASCII characters. The Lexer's job is to group these raw characters into meaningful chunks called Tokens.
Think of it like reading an English sentence. Before you understand the grammar, your brain first identifies the individual words.
Defining our Tokens
First, we need to define the types of tokens our language supports. In Python, a Dataclass and an Enum are perfect for this:
from enum import Enum, auto
from dataclasses import dataclass
class TokenType(Enum):
INT_KEYWORD = auto()
RETURN_KEYWORD = auto()
IDENTIFIER = auto()
INTEGER_LITERAL = auto()
OPEN_PAREN = auto()
CLOSE_PAREN = auto()
OPEN_BRACE = auto()
CLOSE_BRACE = auto()
SEMICOLON = auto()
@dataclass
class Token:
type: TokenType
value: str = ""Scanning the Source Code
Our lexer iterates through the source code string. If it sees a letter, it consumes characters until it forms a word, then checks if that word is a keyword (like int) or just a regular identifier (like main). If it sees a digit, it builds an integer literal.
class Lexer:
def __init__(self, source_code: str):
self.source = source_code
self.pos = 0
def tokenize(self) -> list[Token]:
tokens = []
while self.pos < len(self.source):
char = self.source[self.pos]
# Skip whitespace
if char.isspace():
self.pos += 1
continue
# Handle single-character punctuation
if char == '{':
tokens.append(Token(TokenType.OPEN_BRACE))
self.pos += 1
elif char == '}':
tokens.append(Token(TokenType.CLOSE_BRACE))
self.pos += 1
elif char == '(':
tokens.append(Token(TokenType.OPEN_PAREN))
self.pos += 1
elif char == ')':
tokens.append(Token(TokenType.CLOSE_PAREN))
self.pos += 1
elif char == ';':
tokens.append(Token(TokenType.SEMICOLON))
self.pos += 1
# Handle Identifiers and Keywords
elif char.isalpha():
buffer = ""
while self.pos < len(self.source) and self.source[self.pos].isalnum():
buffer += self.source[self.pos]
self.pos += 1
if buffer == "int":
tokens.append(Token(TokenType.INT_KEYWORD))
elif buffer == "return":
tokens.append(Token(TokenType.RETURN_KEYWORD))
else:
tokens.append(Token(TokenType.IDENTIFIER, buffer))
# Handle Integer Literals
elif char.isdigit():
buffer = ""
while self.pos < len(self.source) and self.source[self.pos].isdigit():
buffer += self.source[self.pos]
self.pos += 1
tokens.append(Token(TokenType.INTEGER_LITERAL, buffer))
else:
raise ValueError(f"Unknown character: {char}")
return tokensRunning our C code through this lexer transforms the raw string into a neat list of Python objects. Now, the compiler doesn't have to care about spaces or newlines anymore!
Phase 2: Parsing (Building the AST)
A list of tokens is great, but it lacks structural meaning. For example, return int 42; is a valid list of tokens, but it violates C grammar. The parser enforces grammar rules and builds an Abstract Syntax Tree (AST).
To parse C, we use a technique called Recursive Descent Parsing. We write a function for each grammar rule. Let's define the AST nodes we need for our simple program:
class ASTNode: pass
@dataclass
class Program(ASTNode):
function: 'Function'
@dataclass
class Function(ASTNode):
name: str
body: 'ReturnStatement'
@dataclass
class ReturnStatement(ASTNode):
expression: intWriting the Parser
The parser reads tokens sequentially. When parsing a function, it strictly expects the sequence: int → identifier → ( → ) → { → statement → }. If this sequence is broken, it throws a Syntax Error.
class Parser:
def __init__(self, tokens: list[Token]):
self.tokens = tokens
self.pos = 0
def current(self) -> Token:
return self.tokens[self.pos] if self.pos < len(self.tokens) else None
def consume(self, expected_type: TokenType) -> Token:
token = self.current()
if token and token.type == expected_type:
self.pos += 1
return token
raise SyntaxError(f"Expected {expected_type}, got {token}")
def parse_function(self) -> Function:
self.consume(TokenType.INT_KEYWORD)
name = self.consume(TokenType.IDENTIFIER).value
self.consume(TokenType.OPEN_PAREN)
self.consume(TokenType.CLOSE_PAREN)
self.consume(TokenType.OPEN_BRACE)
# Parse the statement inside the function
statement = self.parse_statement()
self.consume(TokenType.CLOSE_BRACE)
return Function(name, statement)
def parse_statement(self) -> ReturnStatement:
self.consume(TokenType.RETURN_KEYWORD)
value_token = self.consume(TokenType.INTEGER_LITERAL)
self.consume(TokenType.SEMICOLON)
return ReturnStatement(int(value_token.value))Notice how beautifully the parser mirrors the grammatical rules of the C language. By calling parser.parse_function(), we convert our flat array of tokens into a nested AST object that perfectly represents the logic of our program.
Phase 3: Code Generation
Now for the most satisfying part. We have a conceptual tree representing our code; we need to turn it into x86-64 assembly instructions that the operating system can execute.
Assembly Crash Course
In x86-64 assembly, the CPU has "registers" (tiny, ultra-fast storage slots). When a function returns a value in C, the standard calling convention mandates that the return value must be placed in a specific register called %eax (or %rax for 64-bit).
So, to return 42, we need our compiler to output the equivalent of:
mov $42, %eax(Move the number 42 into the EAX register)ret(Return from the function)
The Generator
Our code generator will walk the AST and emit raw assembly string commands.
class CodeGenerator:
def generate(self, node: ASTNode) -> str:
if isinstance(node, Program):
return self.generate(node.function)
elif isinstance(node, Function):
# We must declare the function name as global so the OS can find 'main'
assembly = f".globl {node.name}\n"
assembly += f"{node.name}:\n"
# Generate code for the function body
assembly += self.generate(node.body)
return assembly
elif isinstance(node, ReturnStatement):
assembly = f" mov ${node.expression}, %eax\n"
assembly += " ret\n"
return assemblyIf we pass our AST into this generator, it spits out exactly the assembly code required to make an executable out of our logic!
Tying it all together
Let's wrap our three phases into a single script that reads a .c file and writes an .s assembly file.
def compile_c_to_asm(source_code: str) -> str:
# 1. Lexing
lexer = Lexer(source_code)
tokens = lexer.tokenize()
# 2. Parsing
parser = Parser(tokens)
ast = Program(parser.parse_function())
# 3. Code Generation
generator = CodeGenerator()
return generator.generate(ast)
# Run the compiler!
source = "int main() { return 42; }"
asm_code = compile_c_to_asm(source)
with open("out.s", "w") as f:
f.write(asm_code)From Assembly to Executable
Our Python script successfully creates out.s. The final step is to use the standard GNU tools (like GCC) to assemble and link this file into a binary executable:
# Assemble and link
$ gcc out.s -o my_program
# Run our newly compiled program
$ ./my_program
# Check the exit code (should be 42)
$ echo $?
42Conclusion
And there you have it! We went from a raw string of C code to a fully functioning executable binary using nothing but Python string manipulation, object-oriented structuring, and a tiny bit of assembly knowledge.
Obviously, real compilers like GCC or Clang are vastly more complex. They handle variable scopes, pointers, memory allocation, arithmetic expression trees (like 2 + 3 * 4), and heavy optimizations. However, the fundamental architecture—Lexer → Parser → AST → Generator—remains exactly the same.
Writing a compiler demystifies the magic of programming. The next time your compiler screams at you with a Syntax Error, you'll know exactly what's happening under the hood: a recursive descent parser just failed a consume() check!