Search code examples
pythonalgorithmcompressiondecodeencode

Python and compression algorithm performance


I'm interessed to this compression algorithm (check the link)

https://github.com/bright-tools/varints

In particular the problem is that the memory overhead for bytearray objects in Python does the compression useless. There is a solution that consider only the size of encoding and NOT the size of data structure? For example:

>>> import sys
>>> list = []
>>> sys.getsizeof(list)
    64

But i would have something like "0" and not 64

How can i avoid the memory overhead? Some come please?


Solution

  • Python is not the language you want to use if you're trying to create tiny data structures. As the README for the project you link notes, you can use byte arrays (not lists) to reduce storage overhead if you can pack a lot of data into a single byte array. But even byte arrays (like strings) come at a cost: a 64-bit CPython install -- that is, the standard python you will get with an x86 Linux install -- uses at least 33 bytes overhead for each byte array. (I say "at least" because Python has no way of measuring memory allocation overhead.) And then there is the computation cost of deserialising the byte stream into the original objectS, if you need them.

    Since the linked page produces smaller objects, I conclude that its tests must have been done on a 32-bit Python install, probably on Windows. So that's one way you can cut down on storage usage.

    If you have Python3.3 or later (and if you don't, just install it :-) ), then you can use the array module, which may be more convenient than a byte array, in part because you can make an array whose elements are the size you need. See the Python manual for details. If you build an array.array using the b or B type modifiers, it will use only one byte per value. If you use h or H, you can store 16-bit integers (signed or unsigned), each one in two bytes. The overhead of an array.array is 64 bytes, just like a list, but the actual elements are much more compact.

    Personally, I wouldn't bother with stuff like this, but I suppose it has its uses. In fact, the reference README page underestimates the storage consumption of a Python list of integers, because it doesn't take into account the size of the integers themselves, which is considerable.

    The size of a list revealed by sys.getsizeof is only the size of the list itself. It does not include the objects in the list, only a references to the object (eight bytes each on a standard Linux Python install). It also includes the memory used by the list's object description, which is 64 bytes on the same standard Python install. (That's the 64 bytes shown in your test.)

    Finally, it might include some extra space at the end, in order to allow appending items to the list without having to reallocate and copy the list. The number of such extra objects depends on a lot of factors, including the precise way the list was constructed, but it appears that you can reduce this particular overhead to zero by copying the list with a slice: a[:].

    In Python, integers are full-blown objects, and they use a surprising amount of space. Or maybe it is not so surprising when you consider that Python integers are bignums, so they have no artificial size limit. According to getsizeof, integers whose absolute value is less than 230 occupy 28 bytes, and each additional 30 bits (or part) cost another four bytes. (In fact, you could bit-pack a large vector of small integers into a single bignum, taking advantage of the fact that left- and right-shift operations are reasonably fast, and thereby shave a few more bytes off of each list. But array.array is almost certainly easier.)


    Some experiments with getsizeof, for reference:

    >>> from sys import getsizeof
    >>> # Strings occupy 48 bytes plus the length of the string plus one byte (presumably for a NUL)
    >>> getsizeof("")   # 48 + 0 + 1
    49
    >>> getsizeof("a")  # 48 + 1 + 1
    50
    >>> getsizeof("abcdefghijklmnopqrstuvwxyz") # 48 + 26 + 1
    75
    >>> getsizeof("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ") # 48 + 52 + 1
    101
    >>> But that's not counted in the size of a list. All the lists are the same size:
    >>> getsizeof([""])
    72
    >>> getsizeof(["a"])
    72
    >>> getsizeof(["abcdefghijklmnopqrstuvwxyz"])
    72
    >>> getsizeof(["abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"])
    72
    >>> # Same for a list containing a single number
    >>> getsizeof([0])
    72
    >>> # Lists need 64 bytes plus 8 bytes per element (a pointer to the element):
    >>> getsizeof([0,1])
    80
    >>> getsizeof([0,1,2])
    88
    >>> getsizeof([0,1,2,3])
    96
    >>> # When you append to a list, Python leaves some extra space for the next appends
    >>> a = [0,1,2,3]
    >>> getsizeof(a)
    96
    >>> # As above, 64 + 4 * 8 bytes. But when we add a single element,
    >>> # we get enough room for four elements, so the next three appends
    >>> # don't require more space:
    >>> a.append(4)
    >>> getsizeof(a)
    128                 
    >>> a.append(5)
    >>> getsizeof(a)
    128
    >>> a.append(6)
    >>> getsizeof(a)
    128
    >>> a.append(7)
    >>> getsizeof(a)
    128
    >>> # When we append the 9th element, we get room for another four
    >>> a.append(8)
    >>> getsizeof(a)
    192
    

    You could save a few bytes by using tuples instead of lists: a tuple, like a byte array, is immutable but if you can live with not being able to modify the object, you can save 16 bytes by using a tuple instead of a list:

    >>> getsizeof( (1,2,3) )
    72
    >>> getsizeof( [1,2,3] )
    88