rtdev logo

Writing a C compiler from scratch

Writing C compiler - Part 1, Lexing

Writing C compiler - Part 2, Parsing

Writing C compiler - Part 3, Evaluation

Contents

Writing a compiler can be challenging and fun. It's fun if you simply take it up as an exercise to learn about lexing, parsing and code generation. Not so fun if you try to implement an actual general purpose language in all it's glory.

Check out this repo to see how I made a simple C compiler
https://github.com/RohitRTdev/simc

To keep this process enjoyable, we will only be building a C compiler that supports a subset of the language features. For more info on the exact features implemented, please visit the repo. With that in mind, let's get onto it.

To write any compiler, we need to do 3 main tasks

  1. Lexing
  2. Parsing
  3. Evaluation and code-generation

To understand these 3 processes, let's first see what we are even trying to do. Whenever faced with a complex task, take a simple example first and see how the solution looks for that case alone, and then work on generalizing it.

Let's consider the code:

    int a = b + c * d++ - *m;

How do we generate the code for this above statement. Before even answering this question, we need to first ask ourselves what architecture do we want to generate code for. I have decided to use the famous x64 / AMD64 architecture since it is the most widely used one in consumer PC's and laptops.

Now that we've decided the arch, let's see how the assembly for the code snippet above looks like.

x64 asm att syntax
    //d++
    movl -12(%rbp), %ebx //Load 'd'(assume it's stored at stack location rbp-12)
    addl 1, -12(%rbp) //Increment 'd'
    //c * d++
    movl -8(%rbp), %eax //Load 'c'
    imull %ebx, %eax // Signed multiply c * d++, now res0 = c * d++
    //*m
    movq -24(%rbp), %rbx //Load 'm'(We're loading into rbx since pointer is 8 bytes)
    movl (%rbx), %ebx //Dereference pointer 'm', so res1 = *m
    //sub res1 from res0
    subl %eax, %ebx // Now res2 = res0 - res1
    //b + res2
    addl (%rbp), %ebx // This gives us the final result
    //a = final_result
    movl %ebx, -28(%rbp) //Store final result at 'a'

Variables b, c and d are assumed to be of type int and m is assumed to be of type int* .

Notice a few things here. First the variable's location. In C, variables are mostly stored in the stack, and it grows downwards in x64 arch. This is not the case always though (variables could also be located in static memory or heap). Also notice the offsets at which the variables are stored. They are allocated in such a way that all the variables are aligned on their natural boundary( int variables are aligned on 4 byte boundaries and pointers are aligned on 8 byte boundaries).


Points to remember

simc doesn't implement structs or unions, but you can take that up as an exercise if you like. With that said, let's move on to first stage of compilation, lexing.

Rohit Jacob George

In case of queries or mistakes you may find, please reach out to me via my email - rohit20011002@gmail.com

© 2023 Rohit Jacob George. All Rights Reserved.