Post

Compiler Architecture Explained

The Glu compiler is a multi-stage compiler that compiles Glu source code into an executable file. The compiler is written in C++ and uses the LLVM compiler infrastructure to generate machine code. The compiler consists of several stages, each of which performs a specific task in the compilation process.

The particularity of the Glu compiler is that it is designed to be reversible. This means that the compiler can generate Glu source code from the LLVM IR representation of a program. This feature allows the compiler to perform optimizations on the LLVM IR representation of a program and then generate Glu source code from the optimized IR. This can be useful for debugging and understanding the behavior of the compiler. Most importantly, it allows the compiler to generate human-readable Glu source code from the optimized IR generated by any LLVM-based compiler, such as Clang, Rust, or Swift.

Compiler Stages

The Glu compiler has multiple stages, each of which performs a specific task in the compilation process. Each stage also has a corresponding reverse stage that can go backwards in the compilation process. The stages are shown on this diagram:

Glu Compiler Stages

The code goes through the following representations during the compilation process:

  • Glu Source Code: The original source code written in the Glu programming language (.glu files).
  • AST: The abstract syntax tree representation of the program.
  • GIL: The Glu Intermediate Language, an intermediate representation of the program that is used for high-level optimization and analysis (.gil files).
  • LLVM IR: The intermediate representation of the program in the LLVM compiler infrastructure (.ll files for textual representation, .bc files for binary representation).
  • MIR: The LLVM Machine IR, the final representation of the program that is used to generate machine code.
  • Object File: The compiled machine code in an object file format (.o on Unix-like systems, .obj on Windows).

The Glu compilation process consists of the following stages:

  • ASTGen: Glu Source Code -> AST: The ASTGen stage parses the Glu source code and generates an abstract syntax tree (AST) representation of the program.
  • GILGen: AST -> GIL: The GILGen stage lowers the AST representation of the program to the Glu Intermediate Language (GIL) representation.
  • GILOptimize: GIL -> GIL: The GIL representation of the program is optimized using high-level optimizations.
  • IRGen: GIL -> LLVM IR: The IRGen stage generates the LLVM IR representation of the program from the GIL representation.

The LLVM infrastructure is used to optimize and generate machine code from the LLVM IR representation of the program, leveraging the powerful optimization passes and code generation capabilities of LLVM.

The Glu decompilation process is similar to the compilation process but in reverse:

  • IRDec: LLVM IR -> GIL: The IRDec stage generates a GIL representation of the program from the LLVM IR representation.
  • GILDec: GIL -> AST: The GILDec stage generates an AST representation of the program from the GIL representation.
  • ASTPrinter: AST -> Glu Source Code: The ASTPrinter stage pretty prints Glu source code from the AST representation of the program.

The Glu compiler is designed to be modular and extensible, allowing developers to easily add optimization passes to the compilation process. The compiler is written in a modular way, with each stage implemented as a separate C++ library. This design allows developers to experiment with new optimizations and transformations without recompiling the entire compiler.

Debug Information

The Glu compiler keeps track of debug information throughout the compilation process to provide accurate source-level debugging information in the generated machine code. The compiler generates debug information in the LLVM IR representation of the program, which is used by the LLVM infrastructure to generate debug information in the final machine code. Debug information includes source file names, line numbers, and variable names, allowing developers to debug their programs using debuggers such as LLDB.

We believe that debugging is important, both without and with optimizations. Therefore, the Glu compiler generates debug information by default, even when optimizations are enabled. This allows developers to debug optimized code as accurately as possible.

It is also possible to generate debug information for the Glu Intermediate Language (GIL) representation of the program, using -ggil. In that case, temporary GIL files are generated, and the debuggers will step through those files instead of the source files. This can be useful when debugging high-level optimizations that are performed on the GIL representation of the program.

Glu Compiler Stages diagram when using -ggil

It is also possible to generate debug information for decompiled Glu source code, using -gdec, which can be a better experience when debugging optimized code. The compiler will go through all compilation stages, then decompile the optimized LLVM IR back to Glu source code, and generate debug information referencing the decompiled source code. This allows stepping through the decompiled source code in the debugger, providing a more natural debugging experience, though the code may not be exactly the same as the original source code.

Glu Compiler Stages diagram when using -gdec

Finally, it is possible to disable debug information generation using -g0, which can reduce the size of the generated object files and make the compilation process faster. However, this is not recommended for development, as it makes debugging more difficult.

Debug information such as variables and their types is very important in the decompilation process, as it allows the compiler to generate better Glu source code from the optimized LLVM IR representation of the program. Without debug information, the decompiled code would be less readable and harder to understand. Therefore, when importing LLVM bitcode generated by other compilers, it is recommended to enable generation of debug information in those compilers, to improve the quality of the decompiled code.

Example Compilation

To compile a Glu program, you can use the gluc compiler, which is the front-end of the Glu compiler. The gluc compiler takes Glu source code as input and generates the requested output, such as LLVM bitcode or an executable file.

We will use the following Glu program as an example:

1
2
3
4
func main() {
    let message: String = "Hello, World!";
    std::print(message);
}

To view the first stage of the compilation process, you can generate the AST representation of the program using the -dump-ast flag:

1
gluc -dump-ast main.glu

This will output the AST representation of the program in a Lisp-like format. You can inspect the AST to understand how the Glu compiler represents the program internally.

It should look like this:

1
2
3
4
5
6
7
8
9
10
(FunctionDecl "main" type="() -> Void"
  (CompoundStmt
    (VarDecl let "message" type="String"
        (StringLiteral "Hello, World!")
    )
    (CallExpr type="Void"
      (ReferenceExpr "std::print")
      (ReferenceExpr "message")
  )
)

You can then look at the GIL representation of the program using the -emit-gil flag. Use the -o flag to specify the output file, or - to output to the console:

1
gluc -emit-gil main.glu -o -

This will output the GIL representation of the program in a textual format. You can inspect the GIL to understand how the Glu compiler lowers the AST representation of the program to the GIL representation.

It should look like this:

1
2
3
4
5
6
7
8
9
10
11
12
import std;

@gil
@location("main.glu":1:1)
func main() {
    %0 = string_literal "Hello, World!", location "main.glu":2:27
    debug_value %0: String, let `message`, type String, location "main.glu":2:9
    %1 = func_ref std::print: (String) -> Void
    %2 = call %1(%0)
    %3 = void
    return %3
}

You can then generate the LLVM IR representation of the program using the -emit-ll flag:

1
2
gluc -emit-ll main.glu -o main.ll
less main.ll

This will output the LLVM IR representation of the program in a textual format. It can be quite verbose, but you can inspect the LLVM IR to understand how the Glu compiler generates LLVM IR from the GIL representation of the program.

Decompilation

You can then decompile the LLVM IR back to GIL representation using the -emit-gil flag:

1
gluc -emit-gil main.ll -o -

Because the input is LLVM IR, the compiler will go through the IRDec stage to generate the GIL representation of the program. You can inspect the GIL to understand how the Glu compiler decompiles LLVM IR back to GIL representation.

The resulting GIL should be similar to the original GIL representation of the program.

Finally, you can decompile the GIL back to Glu source code using the -emit-glu flag:

1
gluc -emit-glu main.ll -o -

This will go through the IRDec, GILDec, and ASTPrinter stages to generate human-readable Glu source code from the LLVM IR representation of the program. You can inspect the decompiled Glu source code to understand how the Glu compiler generates Glu source code from the LLVM IR representation of the program.

For this simple program, the decompiled Glu source code should be identical to the original Glu source code. However, for more complex programs with optimizations applied, the decompiled code may differ from the original source code.

This example demonstrates the various stages of the Glu compilation process and how you can inspect the intermediate representations of the program at each stage. Understanding the compilation process can help you debug and optimize your Glu programs effectively.

Decompiling LLVM IR from Other Compilers

Remember that you can also decompile LLVM IR generated by other compilers back to Glu source code using the Glu compiler. This can be useful for understanding how other compilers optimize and generate code, and for integrating Glu code with code generated by other compilers.

For example, a similar C program compiled with Clang:

1
2
3
4
5
6
7
#include <stdio.h>

int main() {
    char const *message = "Hello, World!";
    puts(message);
    return 0;
}

Can be compiled with Clang to LLVM bitcode:

1
clang -emit-llvm main.c -o main.bc

And then decompiled back to Glu source code:

1
gluc -emit-glu main.bc -o -

This should generate Glu source code similar to the original Glu program, demonstrating the reversibility of the Glu compiler and its ability to generate Glu source code from LLVM IR generated by other compilers.

This post is licensed under CC BY 4.0 by the author.