Frequency distribution in a list of strings based on common beginning substring

Is there a way to get the common substrings that form the beginning of a list of strings? The length of Substring must be greater than 1.

For instance, if there is a list L:

[ "hello", "help", "helium", "habit", "habitat", "hazard"]

The output should be a frequency distribution should be:

he - 3
hel - 3
habit - 2
ha - 3

I can later remove the duplicates formed by the same strings by again applying the same logic to look at the common beginning substrings from the distribution, checking their counts, and if the counts are same, consider only the longest string.

i.e., We get 'he' and 'hel' formed by the same set of words, having the same counts. Hence, picking only "hel" since it is the longest string

Solution

you can use collections.Counter to count repeatitions of substrings, and you just have to create substrings yourself.

a one liner would be to create an consume substrings by using itertools.chain.from_iterable, which would flatten the generators of substrings into one long generator for the counter to consume as follows.

from collections import Counter
from itertools import chain
input_list = [ "hello", "help", "helium", "habit", "habitat", "hazard"]

counter = Counter(
    chain.from_iterable(
        (entry[:i] for i in range(2,len(entry))) for entry in input_list
    )
)

counter_filtered = {x:y for x,y in counter.items() if y != 1}  # remove single entries

print(counter_filtered)

{'he': 3, 'hel': 3, 'ha': 3, 'hab': 2, 'habi': 2}