Search code examples
pythonhashchecksum

Python checksum calculation ideas


I have a table in an excel sheet containing strings, with let's say 3 columns and 2 rows. E.g.:

  • "one" , "two", "three"
  • "four", "five", "six"

I want to come up with an algorithm that can calculate a checksum for the whole table, which would be different if a new row is added (obvious) or even if one string switch places with another one from another row (less obvious).

So for the following table, we would have a different checksum than the previous one:

  • "one" , "five", "three"
  • "four", "two", "six"

I know there are a lot of possible ways to accomplish this, but is there a clean & robust way to do it in python? I had thought about using the position of the string in the table as a factor in the calculation, but I don't know whether it's a good idea or not as checksum calculation is one of my weaker skills.

I feel I should mention that in that same excel file, I have other data that can change without modifying the checksum though. The checksum should only verify the above-mentioned table's data, and must be a numeric value (because I need to send it over a CAN bus afterwards).

Does anyone have any suggestions as to how I could accomplish this? Please be precise and give examples.


EDIT: So I came up with an algorithm but yet completely unsure about its robustness. My idea was to first calculate a checksum for each cell of the first column and multiply it by the row index, then do the same for the seconds column and the third (if we use the same table as in the description). Then finally, sum the resulting checksums by multiplying each one of them by their column index. So something like:

total_checksum = (checksum('one') * 1 + checksum('four') * 2) * 1 + (checksum('two') * 1 + checksum('five') * 2) * 2 + (checksum('three') * 1 + checksum('six') * 2) * 3

I willingly posted this calculation like this without any fancy code just to explain my point. Now for the code (not pretty at the moment sorry for this), we can assume that the table contents have been parsed into a list of dictionary with key=column and item=value. For the table from the example we have:

tab = [{"A": "one", "B": "two", "C": "three"},
       {"A": "four", "B": "five", "C": "six"}]

print str(calculate_checksum(tab))


def calculate_checksum(table):
    """Calculate checksum of a dictionary (represents the excel table)"""

    def checksum(st):
        """Convert a string into tables of ascii characters and reduce it"""
        return reduce(lambda x, y: x + y, map(ord, st))

    total_cks = column_count = 0

    # list the data of each column:
    for column in ['A', 'B', 'C']:
        data_list = [data[column] for data in tables]
        chk = 0
        for row_index in range(len(data_list)):
            chk += checksum(data_list[row_index]) * (row_index + 1) # multiply each cell's checksum by its row index

        column_count += 1
        total_cks += chk * column_count # multiply each column checksum by its column index

    return total_cks

With this algorithm, the original table's checksum is 7254.When I switch "two" and "five", I get 7094 so it works. When I switch "one" and "two" I get 7230 that's covered too.


Solution

  • With the help of Prahlad Yeri, I've managed to achieve this. I've found out that what I needed was in the zlib library.

    I add up all cell contents with ":" between each string (any other character considered safe can be used) and use the adler32 function of the zlib library to calculate a 32 bits checksum out of it.

    import zlib
    
    tab = [{"A": "one", "B": "two", "C": "three"},
       {"A": "four", "B": "five", "C": "six"}]
    
    print str(calculate_checksum(tab))
    
    
    def calculate_checksum(table):
        """Returns a 32bits checksum out of the concatenated cell contents, using Adler algorithm."""
        data = ""
        for row in table:
            for column in ["A", "B", "C"]:
                data += row[column] + ":" # Use ":" to separate each string
    
        return zlib.adler32(data)