Search code examples
assemblycommon-lispsmalltalkincremental-compiler

Cannot understand how incremental assemblers/compilers/image-based systems work


I am learning a little assembly and for my next project I would like to learn how to make an incremental assembler. By "incremental assembler" I mean an assembler that accepts new code while it's running.

Generally, the workflow with assembly is that you write your files and feed them to the assembler+linker and get an executable at the other end. Contrast this with image-based systems such as Smalltalk or SBCL (lisp), where you have a running image, to which functions/expressions are added incrementally.

Does anyone know how that is accomplished in such systems? Assuming we are working on linux OS, do they simply edit the ELF file and reload the whole image every time a new function/expression is executed? or is there a way to load the content of the ELF file and then execute assembly on top of it on the fly (i.e. without writing any additional files to disk)?

Does anyone have a minimal example they can point me to? or books/blogs about such image-based systems and how they are made?


Solution

  • For an image-based system like most Lisps the answer to this is simple (but sometimes fiddly).

    The compiler / assembler takes source code and its end result, as the end result of any compiler or assembler, is one or more arrays of octets representing the resulting object code and perhaps some data associated with it, as well as information on the names which are referred to by the code, defined by the code, relocation information and so on.

    In a traditional system, those arrays are then laboriously written into a file (long ago, when machines had almost no memory, it was probably necessary to write them to files as they were created), and then some program is invoked which glues several of those files together into a single file, patching up references and so on. That resulting file is then loaded into the memory of the machine, yet more patching up done, and finally the machine is told to run it. And the program instantly crashes and the whole process now needs to be done again. (I have skirted over details here).

    And there then needs to be some kind of protocol – in the form of one or more file standard formats – which allows all these multiple different tools to be drag the data to and from memory as many times as they need to. ELF is one such standard: there have been dozens of others.

    In an image-based system none of that bureaucracy is needed: what happens is that the compiler / assembler produces an array of octets of some kind as before, as well as some representations of data. All this data simply lives in memory, and most of the patching up of that array is probably done as it is created. That array is now executable code, so all that needs to happen in principle is that the machine needs to be told 'start running this'. In practice, on a modern machine, more needs to be done: the memory it is in needs to be marked as executable, and probably a little dance needs to happen because memory marked as executable can't be written, and so on.

    You can see this at work:

    > (defun foo (x y)
        (+ x y))
    foo
    
    > (compile 'foo)
    foo
    nil
    nil
    
    > (describe (function foo))
    
    #<Function foo 80200109F4> is a function
    code           #<code foo (76) 80200109C0>
    constants      (0 #<Function foo 80200109F4> foo #(#(1 0) 0) (x y))
    

    So the foo function (the thing the compiler produced) has two components: its code, which is some object which is wrpping the array of octets that the machine will execute. In fact in the implementation I'm using (LispWorks) there are a couple of functions to ask things about the function's code:

    > (system:function-code-length #'foo)
    76
    

    it's 76 octets long, and if you (disassemble 'foo) you will see that this is indeed the length of the code:

    > (disassemble 'foo)
    [...]
          75:      90               nop
    

    You can find its address in memory:

    > (system:function-code-address #'foo)
    550292752884
    

    And you can see that this address can change when the GC relocates it:

    > (clean-down)
    51183616
    
    > (system:function-code-address #'foo)
    559151419204
    

    (clean-down in LW does a fairly big GC: it 'frees memory and reduces the size of the image, if possible'.)

    In summary: what an incremental, image-based compiler / assembler does is the same as what a file-based compiler / assembler does ... except without writing the data into a file, copying it into another file, and then reading that final file back into memory and without the conspiracy of file formats needed to do that. It just relies on the fact that the compiled code is already in memory and runs it there.