floating-point common-lisp inline sbcl multiple-value

Efficiently computing multiple results

Context

I am currently optimizing a library for scientific computing. I'm fairly new to Commom Lisp. The functions I use are quite small and execute in approx. 10 ns to a few hundred nanoseconds on an average laptop. The performance already is quite close to C but I want every bit of speed I can get.

I use SBCL and its compiler notes and (time) macro to optimize (any general advice is welcome). I'm currently optimizing a mono-threaded string of computation that will be included in independent threads further down the road.

Problem

For the sake of argument, let's say I have a function (foo) that adds two lists of 3 variables term by term. Once optimized, it could be something like :

(defun foo (a1 a2 a3 b1 b2 b3)
  (declare (optimize (speed 3))
           (type double-float a1 a2 a3 b1 b2 b3))
  (list (+ a1 b1) (+ a2 b2) (+ a3 b3)))

And I time it with :

(time (dotimes (it 1000000 t) (foo 1.0d0 2.0d0 2.351d0 223.0d0 124.2d0 321d0)))

SBCL notes

From what I gather, the compiler complains that "casting" the result in a list is expensive.

note: doing float to pointer coercion (cost 13)

What I'd like

The complaints of SBCL seem sensible, so I'm looking for a way to do away with those pesky lists, which I have to peel back again anyway at some point later in order to feed them to other calculations. I'm willing to go low-level for this and lose (some) abstraction. A use case could look like :

(let ((res1 0d0)
      (res2 0d0)
      (res3 0d0))
 (declare (type double-float res1 res2 res3))
 (magic-abstraction-I-want res1 res2 res3 (foo 1d0 1d0 1d0 1d0 1d0 1d0)))

With this, I could string computations with minimal or nonexistent overhead, doing purely the computations that are needed and spend no time creating lists or accessing them.

What I tried/Think of trying

Inline

I'm seeing huge performance improvements on simple functions such as foo when putting :

(declaim (inline foo))

From what I understand, it sort of "expands" the function and writes it inline at the level it is called. Is that right, what does it do exactly ? Does this in fact do what I want ? Does it solve the problem of "casting to lists" in some way ?

(Also, feel free to give general speed optimization advice if you see from what I write that there is something that I might have misunderstood)

Edit : After learning about `values`

I modified foo, it is now :

(defun foo (a1 a2 a3 b1 b2 b3)
  (declare (optimize (speed 3))
           (type double-float a1 a2 a3 b1 b2 b3))
  (values (+ a1 b1) (+ a2 b2) (+ a3 b3)))

SBCL still outputs three notes on coercing return values to pointer. And I'm still consing bytes, as measured with :

(time (dotimes (it 1000000 t) (foo 1.0d0 2.0d0 2.351d0 223.0d0 124.2d0 321d0)))

However, the calls with inline are much faster and do not cons anything (as expected I guess):

(declaim (inline foo))
;;;;
(let ((r1 0d0) (r2 0d0) (r3 0d0)
      (a1 1d0) (a2 2d0) (a3 3d0)
      (b1 4d0) (b2 5d0) (b3 6d0))
  (declare (optimize (speed 3))
           (type double-float r1 r2 r3 a1 a2 a3 b1 b2 b3))
  (time (dotimes (it 1000000 t) (setf (values r1 r2 r3) (foo a1 a2 a3 b1 b2 b3)))))

The performance is exactly the same as :

(let ((r1 0d0) (r2 0d0) (r3 0d0)
      (a1 1d0) (a2 2d0) (a3 3d0)
      (b1 4d0) (b2 5d0) (b3 6d0))
  (declare (optimize (speed 3))
           (type double-float r1 r2 r3 a1 a2 a3 b1 b2 b3))
  (time (dotimes (it 1000000 t)
          (setf r1 (+ a1 b1))
          (setf r2 (+ a2 b2))
          (setf r3 (+ a3 b3)))))

Which is exactly what I wanted. The last very minor thing is that SBCL still complains about the optimization of foo, but I can just muffle it.

Solution

Ok so the first thing I want to do is explain what “float to pointer coercion” means.

(Prerequisites: an understanding of what memory is in a machine, bits and a rough idea of the C memory model).

In Lisp, which is dynamically typed, it is values that have types and not variables. Therefore you need some consistent memory representation which can be used to pass any value of any type and you need to be able to determine the type for this. (Note that in some strongly typed functional languages one may still need some kind of structured memory representation for the garbage collection to follow pointers, or a uniform way to represent all pointers in a word so that polymorphism works). The typical choice for this is a pointer, which is always a word (let’s say 8 bytes), and then the object which is pointed to may have a header with more information about its type. In C it would look something like:

struct LispObj_s {
  uint32_t length;
  uint16_t gc_data;
  uint16_t type_code;
  union {
    int64_t as_smallint;
    float as_float
    double as_double;
    char as_string;
    ... /* bignums, arrays, cons, structs, etc */
  }
}

typedef LispObj_s * LispObj

This is bad because then lots of commonly used things (namely integers) have a huge overhead: one word for the pointer to the object, one for the header (saying “I am an integer and I am 1 word long”), and one for the number itself. That’s an overhead of 200% and means that integers require a pointer dereference and possible cache miss. So there’s an optimisation: you use a few of the least significant bits of the pointers to say what the type is (these are free as everything is word-aligned so the three least significant bits are always 0). Then you can have that if (say) the last bit is 0 then you have a fixnum (a small number where arithmetic is fast, in this case 63 bits), if your last bits are 101 you have a pointer to a cons cell, if they are 111 then a pointer to a double, 011 a float and 001 a pointer to something else. Now integers, conses, and floats are cheaper which is good, but the disadvantage is you now need to be a bit more careful to make sure the tags are always right. The problem is we can’t do the same trick for doubles that we did for integers. It’s not really feasible to just chop 2 bits off a double like you do with integers, especially if you want compatibility with external programs.

In compiled code you want to be operating on the objects (like floats) themselves and so keep them in their raw form in registers or the stack. Only the compiled code can access them and it knows what type they are.

When you want to return or pass (e.g.) a float as an argument, the code receiving it doesn’t necessarily know what sort of object it will get and the code sending it certainly doesn’t know that the recipient is wanting a certain type of data. Thus you need to convert it to some uniform form that also says what type it is, and to do this you need to allocate some memory on the for it and pass a tagged pointer to that memory. When calling a function you don’t know what the called function is going to do with the float and therefore you can’t necessarily (see later) allocate it on the stack because the callee might e.g. put it in a global variable and then a bit later that variable would point to garbage data or unmapped memory. It’s even worse when you’re returning as you’d be allocating the float in your own stack frame which you’re about to destroy. Allocation tends to be slow (although much faster than in e.g. C; mainly because you’re writing to memory not in cache) and reading allocated objects tends to be slow (because you need to inspect and remove tags and dereference a pointer, and often cache miss) which is why sbcl complains when it’s optimiser has to allocate and box an object.

One thing one can do to make allocation faster is declare dynamic extent. This tells Lisp that you promise some object is not going to end up pointed to somewhere once the current dynamic scope ends and that therefore it may be allocated on the stack. Stack allocation is faster and more cache local and freeing stack allocation is basically free as the gc is not involved. You can do this when passing arguments but not when returning.

Using values is a bit like stack-allocating a returned list (the &rest parameter for a function can also be stack allocated if dynamic extent in sbcl) except that it’s possible and so this is a more efficient way to return multiple values. Unfortunately due to uniform memory representation, one can’t get an advantage from this for allocating floats.

Answer

However if we know about the function we are calling, we can stack-allocate the space for its return values in advance and then have the function put its values there. If we do this we can also avoid the float to pointer coercion in the following way: sbcl has an optimised representation for float arrays where instead of an array of pointers to floats, the floats are stored directly (e.g. as float* rather than float**):

(defun foo3 (a1 a2 a3 b1 b2 b3 result)
  (declare (optimize (speed 3) (safety 0))
           (type double-float a1 a2 a3 b1 b2 b3)
           (type (simple-array double-float (3)) result))
  (setf (aref result 0) (+ a1 b1)
        (aref result 1) (+ a2 b2)
        (aref result 2) (+ a3 b3))
  nil)

And then we can call this in the following convoluted way:

(let ((foo-result (make-array '(3) :element-type 'double-float :initial-element 0d0)))
  (declare (dynamic-extent foo-result))
  (foo a1 a2 a3 b1 b2 b3 foo-result)
  (let ((c1 (aref foo-result 0)) ...)
    ...))

This way we allocate space for the results on the stack, then foo3 fills them up, then we extract them from the stack, presumably into registers. This should be almost as fast as foo3 returning its result in registers, and crucially doesn’t need heap allocation ((Edit: I don’t think this is true) if sbcl assumes the function won’t change it can call past the type checking/unboxing straight into the meat of the function).

Unfortunately the syntax is unpleasant but there is a way round it in Lisp: macros. One could implement special macros for defining a function like foo3 and a special macro for calling and binding the results but this is bad because you have now split the world into two types of function and and you can’t use higher order functions for foo, and you might be doing macros-generating-macros which is tricky to debug. Instead the idea is this: we generate a fast version with our weird calling convention and a wrapper function which calls it and is set to inline. Then whenever we call the wrapper function we get the advantage of a fast call and all the cost of the wrapper is removed by the compiler.

(defmacro fast-defun (name num-values lambda-list &body body)
  (assert (and (integerp num-values) (plusp num-values)))
  (let ((fast-name (gensym (symbol-name name)))
        (result (gensym "RESULT"))
        (vars (loop repeat num-values collect (gensym "V")))
    `(progn
        (defun ,fastname ,(cons result lambda-list) ;;FIXME: this only works if lambda-list is a list of symbols
          (declare (optimize (speed 3) (safety 0))
                    (type (simple-array double-float (,num-values))))
          ;; Todo: move declarations here from body
          (multiple-value-bind ,vars (progn ,@body)
            ,@(loop for v in vars
                    for n from 0
                    collect `(setf (svref ,result ,n) ,v))
            nil))
       (declaim (inline ,name))
       (defun ,name ,lambda-list
         (let ((,result (make-array '(,num-values) :element-type 'double-float :initial-element 0d0)))
           (declare (dynamic-extent ,result) (optimize (speed 3)))
           (,fast-name ,result ,@lambda-list)
           (values ,@(loop for n below num-values collect `(aref ,result n))))))))

This macro isn’t in its most general form but you can use it like you would do (defun foo above but (fast-defun foo 3.

After some more experimenting, it seems that this is not actually any faster. Somehow consing is still happening and I can’t work out how to avoid it without inlining everything. I also tried making the input a dynamic extent array but that didn’t seem to help.

After looking around a bit more, I failed to find a way to look at intermediate results in the compilation pipeline which made me sad but I scanned around in the source code and I think that, although the compiler can (sometimes) specialise the calling of functions, it can’t specialise the returning of float values.