Search code examples
pythonarraysnumpyreadability

efficiency vs. readability: obfuscation when using nested boolean index arrays


I have some pretty ugly indexing going on. For example, things like

valid[ data[ index[valid[:,0],0] ] == 0, 1] = False

where valid and index are {Nx2} arrays or bools and ints respectively, and data is {N} long.

If I concentrate really hard, I can convince myself that this is doing what I want... but its incredibly obfuscated. How can I unobfuscate something like this efficiently?

I could break it up, for example:

valid_index = index[valid[:,0],0]
invalid_index = (data[ valid_index ] == 0)
valid[ invalid_index, 1 ] = False

But my arrays will have up to 100's of millions of entries so I don't want to duplicate the memory; and I need to remain as speed efficient as possible.


Solution

  • These two code sequences are nearly identical, and should have very similar performance. That's my "gut feeling"--but then I did static analysis and ran a partial benchmark to confirm.

    The clearer option requires four more bytecodes to implement, so will probably be slightly slower. But the extra work is restricted to LOAD_FAST and STORE_FAST, which are just moves from the top of stack (TOS) to/from variables. As the extra work is modest, so should be the performance impact.

    You could benchmark the two approaches on your target equipment for more quantitative precision, but on my 3-year-old laptop, 100 million extra LOAD_FAST / STORE_FAST pairs takes just over 3 seconds on standard CPython 2.7.5. So I estimate this clarity will cost you about 6 seconds per 100M entries. While the PyPy just-in-time Python compiler doesn't use the same bytecodes, I timed its overhead for the clear version at about half that, or 3 seconds per 100M. Compared to other work you're doing to process the items, the clearer version probably is not a significant showdown.

    The TL;DR Backstory

    My first impression is that the code sequences, while different in readability and clarity, are technically very similar, and should not have similar performance characteristics. But let's analyze a bit further using the Python disassembler. I dropped each code snippet into a function:

    def one(valid, data):
        valid[ data[ index[valid[:,0],0] ] == 0, 1] = False
    
    def two(valid, data):
        valid_index = index[valid[:,0],0]
        invalid_index = (data[ valid_index ] == 0)
        valid[ invalid_index, 1 ] = False
    

    Then using Python's bytecode dissassember:

    import dis
    dis.dis(one)
    print "---"
    dis.dis(two)
    

    Gives:

    15           0 LOAD_GLOBAL              0 (False)
                 3 LOAD_FAST                0 (valid)
                 6 LOAD_FAST                1 (data)
                 9 LOAD_GLOBAL              1 (index)
                12 LOAD_FAST                0 (valid)
                15 LOAD_CONST               0 (None)
                18 LOAD_CONST               0 (None)
                21 BUILD_SLICE              2
                24 LOAD_CONST               1 (0)
                27 BUILD_TUPLE              2
                30 BINARY_SUBSCR       
                31 LOAD_CONST               1 (0)
                34 BUILD_TUPLE              2
                37 BINARY_SUBSCR       
                38 BINARY_SUBSCR       
                39 LOAD_CONST               1 (0)
                42 COMPARE_OP               2 (==)
                45 LOAD_CONST               2 (1)
                48 BUILD_TUPLE              2
                51 STORE_SUBSCR        
                52 LOAD_CONST               0 (None)
                55 RETURN_VALUE        
    

    18           0 LOAD_GLOBAL              0 (index)
                 3 LOAD_FAST                0 (valid)
                 6 LOAD_CONST               0 (None)
                 9 LOAD_CONST               0 (None)
                12 BUILD_SLICE              2
                15 LOAD_CONST               1 (0)
                18 BUILD_TUPLE              2
                21 BINARY_SUBSCR       
                22 LOAD_CONST               1 (0)
                25 BUILD_TUPLE              2
                28 BINARY_SUBSCR       
                29 STORE_FAST               2 (valid_index)
    
    19          32 LOAD_FAST                1 (data)
                35 LOAD_FAST                2 (valid_index)
                38 BINARY_SUBSCR       
                39 LOAD_CONST               1 (0)
                42 COMPARE_OP               2 (==)
                45 STORE_FAST               3 (invalid_index)
    
    20          48 LOAD_GLOBAL              1 (False)
                51 LOAD_FAST                0 (valid)
                54 LOAD_FAST                3 (invalid_index)
                57 LOAD_CONST               2 (1)
                60 BUILD_TUPLE              2
                63 STORE_SUBSCR        
                64 LOAD_CONST               0 (None)
                67 RETURN_VALUE        
    

    Similar but not identical, and not in the same order. A quick diff of the two shows the same, plus the possibility the clearer function requires more byte codes.

    I parsed the bytecode opcodes out of each function's disassembler listing, dropped them into a collections.Counter, and compared the counts:

    Bytecode       Count(s) 
    ========       ======== 
    BINARY_SUBSCR  3        
    BUILD_SLICE    1        
    BUILD_TUPLE    3        
    COMPARE_OP     1        
    LOAD_CONST     7        
    LOAD_FAST      3, 5     *** differs ***
    LOAD_GLOBAL    2        
    RETURN_VALUE   1        
    STORE_FAST     0, 2     *** differs ***
    STORE_SUBSCR   1   
    

    Here is where it becomes evident that the second, clearer approach uses only four more bytecodes, and of the simple, fast LOAD_FAST / STORE_FAST variety. Static analysis thus shows no particular reason to fear additional memory allocation or other performance-killing side effects.

    I then constructed two functions, very similar to one another, that the disassembler shows differ only in that the second one has an extra LOAD_FAST / STORE_FAST pair. I ran them 100,000,000 times, and compared their runtimes. They differed by just over 3 seconds in CPython 2.7.5, and about 1.5 seconds under PyPy 2.2.1 (based on Python 2.7.3). Even when you double those times (because you have two pairs), it's pretty clear those extra load/store pairs are not going to slow you down much.