python algorithm compression data-analysis lossless-compression

How would I go about finding the most common substring in a file

To preface, I am attempting to create my own compression method, wherein I do not care about speed, so lots of iterations over large files is plausible. However, I am wondering if there is any method to get the most common substrings of length of 2 or more (3 most likely), as any larger would not be plausible. I am wondering if you can do this without splitting, or anything like that, no tables, just search the string. Thanks.

Solution

You probably want to use something like collections.Counter to associate each substring with a count, e.g.:

>>> data = "the quick brown fox jumps over the lazy dog"
>>> c = collections.Counter(data[i:i+2] for i in range(len(data)-2))
>>> max(c, key=c.get)
'th'
>>> c = collections.Counter(data[i:i+3] for i in range(len(data)-3))
>>> max(c, key=c.get)
'the'

What does cls() function do inside a class method?
Load Registered Component in Azure ML for Pipeline using Python sdk v2
Can I add message to the tqdm progressbar?
How can I preserve the previous value to find the row that is greater than it?
What's a correct and good way to implement __hash__()?
Python Harvesters Image Acquisition GigeCam
Why do I get a recursion error when the depth of the expected recursion should be way less than 999?
How to (intermittently) skip certain cells when running IPython notebook?
How to give jupyter cell standard input in python?
Connect EtherCAT Device with pysoem
Regex to substitute the next two words after a matching point
What is the time complexity of heapq.nlargest?
How to check if a given number is a power of two?
How to move Jupyter notebook cells up/down using keyboard shortcut?
cannot override sys.excepthook
Getting Python's unittest results in a tearDown() method
Is redis in Python asynchronous?
How to tell Click to always show option defaults
flask sqlalchemy with circular imports with db models
.join with a query
Python's argparse to show program's version with prog and version string formatting
classifiers in scikit-learn that handle nan/null
How does improve detection of line in images by open cv?
Performance Advantages to Iterators?
Problem with selecting a specific web element with Playwright in Python
Thread long running python task, doesn't work inside Docker
How to add a column with JSON representation of rows in Polars DataFrame?
Pandas insert empty row at 0th position
What should HTTP 201 response body be when responding to a POST request with large data?
Can Python's unittest test in parallel, like nose can?