Search code examples
smalltalkpharosqueak

Are Smalltalk bytecode optimizations worth the effort?


Consider the following method in the Juicer class:

Juicer >> juiceOf: aString
    | fruit juice |
    fruit := self gather: aString.
    juice := self extractJuiceFrom: fruit.
    ^juice withoutSeeds

It generates the following bytecodes

25 self                     ; 1
26 pushTemp: 0              ; 2
27 send: gather:
28 popIntoTemp: 1           ; 3
29 self                     ; 4
30 pushTemp: 1              ; 5
31 send: extractJuiceFrom:
32 popIntoTemp: 2           ; 6 <-
33 pushTemp: 2              ; 7 <-
34 send: withoutSeeds
35 returnTop

Now note that 32 and 33 cancel out:

25 self                     ; 1
26 pushTemp: 0              ; 2
27 send: gather:
28 popIntoTemp: 1           ; 3 *
29 self                     ; 4 *
30 pushTemp: 1              ; 5 *
31 send: extractJuiceFrom:
32 storeIntoTemp: 2         ; 6 <-
33 send: withoutSeeds
34 returnTop

Next consider 28, 29 and 30. They insert self below the result of gather. The same stack configuration could have been achieved by pushing self before sending the first message:

25 self                     ; 1 <-
26 self                     ; 2
27 pushTemp: 0              ; 3
28 send: gather:
29 popIntoTemp: 1           ; 4 <-
30 pushTemp: 1              ; 5 <-
31 send: extractJuiceFrom:
32 storeIntoTemp: 2         ; 6
33 send: withoutSeeds
34 returnTop

Now cancel out 29 and 30

25 self                     ; 1
26 self                     ; 2
27 pushTemp: 0              ; 3
28 send: gather:
29 storeIntoTemp: 1         ; 4 <-
30 send: extractJuiceFrom:
31 storeIntoTemp: 2         ; 5
32 send: withoutSeeds
33 returnTop

Temporaries 1 and 2 are written but not read. So, except when debugging, they could be skipped leading to:

25 self                     ; 1
26 self                     ; 2
27 pushTemp: 0              ; 3
28 send: gather:
29 send: extractJuiceFrom:
30 send: withoutSeeds
31 returnTop

This last version, which saves 4 out 7 stack operations, corresponds to the less expressive and clear source:

Juicer >> juiceOf: aString
    ^(self extractJuiceFrom: (self gather: aString)) withoutSeeds

Note also that there are other possible optimizations that Pharo (I haven't checked Squeak) does not implement (e.g., jump chaining.) These optimizations would encourage the Smalltalk programmer to better express their intentions without having to pay the cost of additional computations.

My question is whether these improvements are an illusion or not. Concretely, are bytecode optimizations absent from Pharo/Squeak because they are known to have little relevance, or are they regarded as beneficial but haven't been addressed yet?

EDIT

An interesting advantage of using a register+stack architecture [cf. A Smalltalk Virtual Machine Architectural Model by Allen Wirfs-Brock and Pat Caudill] is that the additional space provided by registers makes it easier the manipulation of bytecodes for the sake of optimization. Of course, even though these kinds of optimizations are not as relevant as method inlining or polymorphic inline caches, as pointed out in the answer below, they shouldn't be disregarded, especially when combined with others implemented by the JIT compiler. Another interesting topic to analyze is whether destructive optimization (i.e., the one that requires de-optimization for supporting the debugger) is actually necessary or enough performance gains can be attained by non-destructive techniques.


Solution

  • The main annoyance when you start playing with such optimizations is debugger interface.

    Historically and still currently in Squeak, the debugger is simulating the bytecode level and needs to map the bytecodes to corresponding Smalltalk instruction.

    So I think the gain was too low for justifying complexification, or even worse degradation of debugging facility.

    Pharo wants to change the debugger to operate at a higher level (Abstract Syntax Tree), but I don't know how they will end up at bytecode which is all the VM knows of.

    IMO, this kind of optimization might better be implemented in the JIT compiler which transforms bytecode to machine native code.

    EDIT

    The greatest gains are in eliminating the sends themselves (by inlining) because they are much more expensive (x10) than the stack operations - there are 10 times more bytecodes executed per second than sends when you test 1 tinyBenchmarks (COG VM).

    Interestingly, such optimizations could take place in the Smalltalk image, but only on hotspot detected by VM, as in the SISTA effort. See for example https://clementbera.wordpress.com/2014/01/22/the-sista-chronicles-iii-an-intermediate-representation-for-optimizations/

    So, in the light of SISTA, the answer is rather: interesting, not yet addressed, but actively studied (and work in progress)!

    All the machinery for de-optimizing when the method has to be debugged still is one of the difficult points as I understand it.