Back to Blog

Building a C Compiler in Python from Scratch: A Deep Dive

15 min read

Compilers are often viewed as the dark magic of computer science. You feed them human-readable text, and out pops a binary executable that a CPU understands natively. But behind the curtain, it's just a series of deterministic text transformations. Today, we're going to pull back that curtain.

In this deep dive, we won't just talk about high-level concepts. We are going to roll up our sleeves and write actual Python code to compile a minimal C program into real, executable x86-64 assembly.

Our goal? To successfully compile and run this exact C code:

int main() {
    return 42;
}

Let's break down the architecture of our compiler into three distinct, manageable phases: Lexical Analysis, Parsing, and Code Generation.

Source Code
Lexer
Parser
Generator
Assembly

Phase 1: Lexical Analysis (The Lexer)

A CPU has no concept of "words", "keywords", or "punctuation". To a computer, our C code is just a flat array of ASCII characters. The Lexer's job is to group these raw characters into meaningful chunks called Tokens.

Think of it like reading an English sentence. Before you understand the grammar, your brain first identifies the individual words.

Defining our Tokens

First, we need to define the types of tokens our language supports. In Python, a Dataclass and an Enum are perfect for this:

from enum import Enum, auto
from dataclasses import dataclass

class TokenType(Enum):
    INT_KEYWORD = auto()
    RETURN_KEYWORD = auto()
    IDENTIFIER = auto()
    INTEGER_LITERAL = auto()
    OPEN_PAREN = auto()
    CLOSE_PAREN = auto()
    OPEN_BRACE = auto()
    CLOSE_BRACE = auto()
    SEMICOLON = auto()

@dataclass
class Token:
    type: TokenType
    value: str = ""

Scanning the Source Code

Our lexer iterates through the source code string. If it sees a letter, it consumes characters until it forms a word, then checks if that word is a keyword (like int) or just a regular identifier (like main). If it sees a digit, it builds an integer literal.

class Lexer:
    def __init__(self, source_code: str):
        self.source = source_code
        self.pos = 0

    def tokenize(self) -> list[Token]:
        tokens = []
        while self.pos < len(self.source):
            char = self.source[self.pos]

            # Skip whitespace
            if char.isspace():
                self.pos += 1
                continue

            # Handle single-character punctuation
            if char == '{':
                tokens.append(Token(TokenType.OPEN_BRACE))
                self.pos += 1
            elif char == '}':
                tokens.append(Token(TokenType.CLOSE_BRACE))
                self.pos += 1
            elif char == '(':
                tokens.append(Token(TokenType.OPEN_PAREN))
                self.pos += 1
            elif char == ')':
                tokens.append(Token(TokenType.CLOSE_PAREN))
                self.pos += 1
            elif char == ';':
                tokens.append(Token(TokenType.SEMICOLON))
                self.pos += 1

            # Handle Identifiers and Keywords
            elif char.isalpha():
                buffer = ""
                while self.pos < len(self.source) and self.source[self.pos].isalnum():
                    buffer += self.source[self.pos]
                    self.pos += 1

                if buffer == "int":
                    tokens.append(Token(TokenType.INT_KEYWORD))
                elif buffer == "return":
                    tokens.append(Token(TokenType.RETURN_KEYWORD))
                else:
                    tokens.append(Token(TokenType.IDENTIFIER, buffer))

            # Handle Integer Literals
            elif char.isdigit():
                buffer = ""
                while self.pos < len(self.source) and self.source[self.pos].isdigit():
                    buffer += self.source[self.pos]
                    self.pos += 1
                tokens.append(Token(TokenType.INTEGER_LITERAL, buffer))
            else:
                raise ValueError(f"Unknown character: {char}")

        return tokens

Running our C code through this lexer transforms the raw string into a neat list of Python objects. Now, the compiler doesn't have to care about spaces or newlines anymore!


Phase 2: Parsing (Building the AST)

A list of tokens is great, but it lacks structural meaning. For example, return int 42; is a valid list of tokens, but it violates C grammar. The parser enforces grammar rules and builds an Abstract Syntax Tree (AST).

FunctionDeclaration(main)
ReturnStatement
Integer(42)

To parse C, we use a technique called Recursive Descent Parsing. We write a function for each grammar rule. Let's define the AST nodes we need for our simple program:

class ASTNode: pass

@dataclass
class Program(ASTNode):
    function: 'Function'

@dataclass
class Function(ASTNode):
    name: str
    body: 'ReturnStatement'

@dataclass
class ReturnStatement(ASTNode):
    expression: int

Writing the Parser

The parser reads tokens sequentially. When parsing a function, it strictly expects the sequence: intidentifier(){statement}. If this sequence is broken, it throws a Syntax Error.

class Parser:
    def __init__(self, tokens: list[Token]):
        self.tokens = tokens
        self.pos = 0

    def current(self) -> Token:
        return self.tokens[self.pos] if self.pos < len(self.tokens) else None

    def consume(self, expected_type: TokenType) -> Token:
        token = self.current()
        if token and token.type == expected_type:
            self.pos += 1
            return token
        raise SyntaxError(f"Expected {expected_type}, got {token}")

    def parse_function(self) -> Function:
        self.consume(TokenType.INT_KEYWORD)
        name = self.consume(TokenType.IDENTIFIER).value
        self.consume(TokenType.OPEN_PAREN)
        self.consume(TokenType.CLOSE_PAREN)
        self.consume(TokenType.OPEN_BRACE)

        # Parse the statement inside the function
        statement = self.parse_statement()

        self.consume(TokenType.CLOSE_BRACE)
        return Function(name, statement)

    def parse_statement(self) -> ReturnStatement:
        self.consume(TokenType.RETURN_KEYWORD)
        value_token = self.consume(TokenType.INTEGER_LITERAL)
        self.consume(TokenType.SEMICOLON)
        return ReturnStatement(int(value_token.value))

Notice how beautifully the parser mirrors the grammatical rules of the C language. By calling parser.parse_function(), we convert our flat array of tokens into a nested AST object that perfectly represents the logic of our program.


Phase 3: Code Generation

Now for the most satisfying part. We have a conceptual tree representing our code; we need to turn it into x86-64 assembly instructions that the operating system can execute.

Assembly Crash Course

In x86-64 assembly, the CPU has "registers" (tiny, ultra-fast storage slots). When a function returns a value in C, the standard calling convention mandates that the return value must be placed in a specific register called %eax (or %rax for 64-bit).

So, to return 42, we need our compiler to output the equivalent of:

The Generator

Our code generator will walk the AST and emit raw assembly string commands.

class CodeGenerator:
    def generate(self, node: ASTNode) -> str:
        if isinstance(node, Program):
            return self.generate(node.function)

        elif isinstance(node, Function):
            # We must declare the function name as global so the OS can find 'main'
            assembly = f".globl {node.name}\n"
            assembly += f"{node.name}:\n"

            # Generate code for the function body
            assembly += self.generate(node.body)
            return assembly

        elif isinstance(node, ReturnStatement):
            assembly = f"    mov ${node.expression}, %eax\n"
            assembly += "    ret\n"
            return assembly

If we pass our AST into this generator, it spits out exactly the assembly code required to make an executable out of our logic!


Tying it all together

Let's wrap our three phases into a single script that reads a .c file and writes an .s assembly file.

def compile_c_to_asm(source_code: str) -> str:
    # 1. Lexing
    lexer = Lexer(source_code)
    tokens = lexer.tokenize()

    # 2. Parsing
    parser = Parser(tokens)
    ast = Program(parser.parse_function())

    # 3. Code Generation
    generator = CodeGenerator()
    return generator.generate(ast)

# Run the compiler!
source = "int main() { return 42; }"
asm_code = compile_c_to_asm(source)

with open("out.s", "w") as f:
    f.write(asm_code)

From Assembly to Executable

Our Python script successfully creates out.s. The final step is to use the standard GNU tools (like GCC) to assemble and link this file into a binary executable:

# Assemble and link
$ gcc out.s -o my_program

# Run our newly compiled program
$ ./my_program

# Check the exit code (should be 42)
$ echo $?
42

Conclusion

And there you have it! We went from a raw string of C code to a fully functioning executable binary using nothing but Python string manipulation, object-oriented structuring, and a tiny bit of assembly knowledge.

Obviously, real compilers like GCC or Clang are vastly more complex. They handle variable scopes, pointers, memory allocation, arithmetic expression trees (like 2 + 3 * 4), and heavy optimizations. However, the fundamental architecture—Lexer → Parser → AST → Generator—remains exactly the same.

Writing a compiler demystifies the magic of programming. The next time your compiler screams at you with a Syntax Error, you'll know exactly what's happening under the hood: a recursive descent parser just failed a consume() check!