python arrays encoding compression difference

Efficiently encoding (RLE) an array of generally consecutive line numbers

I have some text transformation code (a DSL engine translating DSL code into native code) and am attempting to improve the ability to diagnose errors by mapping line numbers in exceptions to the line in the DSL source that originated it, vs. the generated source.

To this end, during conversion I track the originating line number for each transformed, or triggering line number for each generated chunk of native code. At the end in development I emit the line numbers as a plain list; this makes it way easier for humans to utilize, e.g. __mapping__[line] immediately retrieves the original line number. In production environments, however, keeping around so many (potentially large) arbitrary lists for literally only exceptional events is sub-optimal. Instead, I would like to encode this list in some sensible way, without regard for encoding difficulty (all the time in the world when generating the code) and preferring simple and fast decoding.

Experimenting, I have come up with:

from base64 import b64encode
from zlib import compress

def redelta_encode(numbers):
    """Encode a series of line numbers as the difference from line to line (deltas) to reduce entropy.

    The delta is stored as a single signed byte per line, meaning no two consecutive lines may vary by more than +/-
    127 lines within the original source material. The resulting bytestring is then zlib compressed and b64 encoded.

    Lines without line numbers (none or unexpected zeros) will inherit the last known line number after decoding.
    """

    def inner():
        lines = iter(numbers)
        prev = next(lines)

        for line in lines:
            delta = (line or prev) - prev  # Handle the "no line number given" case.
            if delta < 0: delta = -1 * delta + 127  # Store "signed" values.
            prev = (line or prev)  # Track our line number, or the last known good one.
            yield delta  # Store the delta.

    return b64encode(compress(bytes(bytearray(inner())))).decode('latin1')

In testing, this actually appears to be a pathological case in zlib:

In [1]: redelta_encode(list(range(10000)))
Out[1]: 'eJztwQEJAAAAwyDWv/RzHNQCAAAAAAAAAACAewMwvCcQ'

The larger the range, the more A's are present, making the resulting compressed content ironically highly compressible. Is there an algorithmically optimal, simply better, or preferred way to store this type of generally uniform or monotonous list of integers? Obviously in the perfect case I store nothing: no number translation is needed, but most DSLs will transform or generate lines, adding irregularity to the overall pattern.

As a note, this is for the FOSS marrow/dsl (DSL engine) and cinje (template engine using it) projects. Thanks in advance!

Edited a year after being closed as inactive: I've finally figured out the proper name for what I'm looking for; this is a form of run-length encoding (RLE) tuned for encoding range()s. I think the pre-processing (transforming actual line numbers into the increment amount line to line) technique would be helpful for this. I've prepared an example translation of the cinje.std.html module, demonstrating the line number mapping on line 231 of 02-generated.py.

Solution

Indeed, run-length encoding appears to be the solution to my problem. Given a __mapping__ like the following (from my example translation):

__mapping__ = [0,2,3,4,5,6,6,6,6,6,6,6,6,6,7,8,9,10,11,12,12,12,12,13,14,15,16,17,19,20,21,22,23,23,23,24,25,25,25,28,29,29,29,31,32,32,32,34,35,35,35,35,38,39,40,40,40,40,41,42,43,44,44,44,46,47,47,47,47,50,51,52,52,52,52,53,54,55,56,58,59,60,60,60,61,62,63,65,66,67,67,67,68,68,68,68,68,68,69,70,71,73,74,74,76,77,78,79,79,79,79,80,81,81,81,81,81,82,82,82,82,82,82,83,83,85,86,87,87,87,87,88,89,89,89,90,90,90,90,90,90,91,91,93,94,95,95,95,95,96,97,97,97,98,99,100,101,103,103,105,106,107,107,107,107,108,109,109,109,109,110,111,112,113,115,115,115,115,117,118,119,119,119,119,120,121,121,121,121,121,121,123,124,125,125,125,125,126,127,128,129,130,131,132,133,134,137,138,138,138,138,139,140,141,141,141,142,142,142,144,145,146,146,146,149,149,149,149,151,151,151]

Transformed into the intra-element deltas, the shape of the repeating pattern should become immediately apparent:

deltas = [2,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,1,1,1,1,1,2,1,1,1,1,0,0,1,1,0,0,3,1,0,0,2,1,0,0,2,1,0,0,0,3,1,1,0,0,0,1,1,1,1,0,0,2,1,0,0,0,3,1,1,0,0,0,1,1,1,1,2,1,1,0,0,1,1,1,2,1,1,0,0,1,0,0,0,0,0,1,1,1,2,1,0,2,1,1,1,0,0,0,1,1,0,0,0,0,1,0,0,0,0,0,1,0,2,1,1,0,0,0,1,1,0,0,1,0,0,0,0,0,1,0,2,1,1,0,0,0,1,1,0,0,1,1,1,1,2,0,2,1,1,0,0,0,1,1,0,0,0,1,1,1,1,2,0,0,0,2,1,1,0,0,0,1,1,0,0,0,0,0,2,1,1,0,0,0,1,1,1,1,1,1,1,1,1,3,1,0,0,0,1,1,1,0,0,1,0,0,2,1,1,0,0,3,0,0,0,2,0,0]

Utilizing the following (ridiculously simple, upon consideration) groupby to generate the RLE form: (from this SO answer)

>>> rle = [(k, sum(1 for i in g)) for k,g in groupby(deltas)]; rle
[(2, 1), (1, 4), (0, 8), (1, 6), (0, 3), (1, 5), (2, 1), (1, 4), (0, 2), (1, 2), (0, 2), (3, 1), (1, 1), (0, 2), (2, 1), (1, 1), (0, 2), (2, 1), (1, 1), (0, 3), (3, 1), (1, 2), (0, 3), (1, 4), (0, 2), (2, 1), (1, 1), (0, 3), (3, 1), (1, 2), (0, 3), (1, 4), (2, 1), (1, 2), (0, 2), (1, 3), (2, 1), (1, 2), (0, 2), (1, 1), (0, 5), (1, 3), (2, 1), (1, 1), (0, 1), (2, 1), (1, 3), (0, 3), (1, 2), (0, 4), (1, 1), (0, 5), (1, 1), (0, 1), (2, 1), (1, 2), (0, 3), (1, 2), (0, 2), (1, 1), (0, 5), (1, 1), (0, 1), (2, 1), (1, 2), (0, 3), (1, 2), (0, 2), (1, 4), (2, 1), (0, 1), (2, 1), (1, 2), (0, 3), (1, 2), (0, 3), (1, 4), (2, 1), (0, 3), (2, 1), (1, 2), (0, 3), (1, 2), (0, 5), (2, 1), (1, 2), (0, 3), (1, 9), (3, 1), (1, 1), (0, 3), (1, 3), (0, 2), (1, 1), (0, 2), (2, 1), (1, 2), (0, 2), (3, 1), (0, 3), (2, 1), (0, 2)]

Getting some quick-and-dirty stats off of these structures, even before string serialization:

>>> sys.getsizeof(__mapping__)
1912
>>> sys.getsizeof(deltas)
2072
>>> sys.getsizeof(rle)
912

47% of original size is not unreasonable savings. To decode ("decompress") without requiring total decompression just to pull out a single value would require iteration over the runs, counting lines as you progress, stopping when you've reached your target generated line number:

from itertools import chain, groupby, repeat


__mapping__ = [0, 2, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 8, 9, 10, 11, 12, 12, 12, 12, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 23, 23, 24, 25, 25, 25, 28, 29, 29, 29, 31, 32, 32, 32, 34, 35, 35, 35, 35, 38, 39, 40, 40, 40, 40, 41, 42, 43, 44, 44, 44, 46, 47, 47, 47, 47, 50, 51, 52, 52, 52, 52, 53, 54, 55, 56, 58, 59, 60, 60, 60, 61, 62, 63, 65, 66, 67, 67, 67, 68, 68, 68, 68, 68, 68, 69, 70, 71, 73, 74, 74, 76, 77, 78, 79, 79, 79, 79, 80, 81, 81, 81, 81, 81, 82, 82, 82, 82, 82, 82, 83, 83, 85, 86, 87, 87, 87, 87, 88, 89, 89, 89, 90, 90, 90, 90, 90, 90, 91, 91, 93, 94, 95, 95, 95, 95, 96, 97, 97, 97, 98, 99, 100, 101, 103, 103, 105, 106, 107, 107, 107, 107, 108, 109, 109, 109, 109, 110, 111, 112, 113, 115, 115, 115, 115, 117, 118, 119, 119, 119, 119, 120, 121, 121, 121, 121, 121, 121, 123, 124, 125, 125, 125, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 137, 138, 138, 138, 138, 139, 140, 141, 141, 141, 142, 142, 142, 144, 145, 146, 146, 146, 149, 149, 149, 149, 151, 151, 151]


def delta_encode(numbers):
    lines = iter(numbers)
    prev = next(lines)

    for line in lines:
        delta = (line or prev) - prev
        yield delta
        prev = (line or prev)


def delta_decode(encoded):
    line = 0

    for n in chain.from_iterable(repeat(k, n) for k, n in encoded):
        yield line
        line += n

    yield line


rle = [(k, sum(1 for i in g)) for k, g in groupby(delta_encode(__mapping__))]


def line_for(target, rle):
    # The array of line numbers is zero-based, but humans don't think about code that way.
    target -= 1

    source = 0

    for line, result in enumerate(delta_decode(rle)):
        if line == target:
            return __mapping__[target] + 1, result + 1


print(line_for(1, rle))
print(line_for(20, rle))
print(line_for(200, rle))

Resulting in correct answers: (source mapped result on the left, RLE decompressed result on the right)

(1, 1)
(13, 13)
(129, 129)