Search code examples
modulecompiler-constructionlanguage-design

implementing a module system in a programming language


I'm writing my own compiler and I'm struggling to implement a module system. Can someone guide me, how should this be done? how other languages tackle this? Also I'm trying to avoid what c and c++ do (header files). I do like the module system in Go/Golang though.

I don't know if this is relevant, but I'm using LLVM (maybe there's a magic way to import symbols).

my initial approach:

  • read and parse the entry point source file ie. main.mylang.
  • go through the imports of main.mylang
  • for each import: read, parse and resolve it's imports ...

this leads to a tree structure:

  • main.mylang: import1.mylang, import2.mylang, import3.mylang
  • import1.mylang: import4.mylang, import5.mylang
  • import2.mylang: import6.mylang

... etc.

then I would traverse each node and copy it's symbols (functions, global variables, etc.) to the parent node's symbol table. if a parent node is null, it's an entry point file and the compiler can start output object files.

why do I think that this is bad?

  1. it's very slow, even when compiling 3-5 source files
  2. it's easy to cause name collisions
  3. you have to import the entire symbol table, because the imported file's exported symbols depend on the internal ones. for example: imagine an exported function that modifies an internal global variable

Thanks in advance


Solution

  • I think your approach is really good. Compile time speed is not that important, usability is. To prevent name collisions you can use some kind of module-namespace (importname.foo() instead of just foo()) and whenever foo does not exist allow both methods. Alternatively you could insert a placeholder in the parents symbol table and whenever the user uses that name you throw a compile time error (something like ambiguous symbol). that would look like this: main.mylang

    import module1
    import module2
    int main() {}
    

    module1.mylang

    import module2
    void foo() {}
    void bar() {}
    

    module2.mylang

    import module1
    void bar() {}
    void fun() {}
    

    After finding loops, the tree would look like this:

    main
    ├──module1
    │  └──module2
    └──module2
       └──module1
    

    And a graph like this:

    main
    ├─>main()
    ├─>foo() (module1)
    ├─>bar() (defined twice, throw error when used)
    ├─>fun() (module2)
    ├─>module1<───────────┐
    │  ├─>foo() (module1) │
    │  └─>bar() (module1) │
    └─>import2<───────────┘
       ├─>bar() (module2)
       └─>fun() (module2)
    

    I don't know much about llvm, but I am pretty sure normal tables are not enough to archive this. You will at least need nested tables if not even a graph like structure like I described. Also this is not possible with classical C/C++ architecture, except if you use unique identifiers as symbols and don't let the user know (like c++ function overloading). For example you could call one function __import1_bar and the other __import2_bar and whenever the user uses bar() you look up in this graph which one he wants to call. In the main function using import1.bar() will lead you to __import1_bar (follow the graph) and import2.bar() or import1.import2.bar() will lead you to __import2_bar.

    Good Luck figuring that out. But it is certainly a interesting problem.