Search code examples
pythonalgorithmcompressiondata-analysislossless-compression

How would I go about finding the most common substring in a file


To preface, I am attempting to create my own compression method, wherein I do not care about speed, so lots of iterations over large files is plausible. However, I am wondering if there is any method to get the most common substrings of length of 2 or more (3 most likely), as any larger would not be plausible. I am wondering if you can do this without splitting, or anything like that, no tables, just search the string. Thanks.


Solution

  • You probably want to use something like collections.Counter to associate each substring with a count, e.g.:

    >>> data = "the quick brown fox jumps over the lazy dog"
    >>> c = collections.Counter(data[i:i+2] for i in range(len(data)-2))
    >>> max(c, key=c.get)
    'th'
    >>> c = collections.Counter(data[i:i+3] for i in range(len(data)-3))
    >>> max(c, key=c.get)
    'the'