Search code examples
pythonsubstring

Frequency distribution in a list of strings based on common beginning substring


Is there a way to get the common substrings that form the beginning of a list of strings? The length of Substring must be greater than 1.

For instance, if there is a list L:

[ "hello", "help", "helium", "habit", "habitat", "hazard"]

The output should be a frequency distribution should be:

he - 3
hel - 3
habit - 2
ha - 3

I can later remove the duplicates formed by the same strings by again applying the same logic to look at the common beginning substrings from the distribution, checking their counts, and if the counts are same, consider only the longest string.

i.e., We get 'he' and 'hel' formed by the same set of words, having the same counts. Hence, picking only "hel" since it is the longest string


Solution

  • you can use collections.Counter to count repeatitions of substrings, and you just have to create substrings yourself.

    a one liner would be to create an consume substrings by using itertools.chain.from_iterable, which would flatten the generators of substrings into one long generator for the counter to consume as follows.

    from collections import Counter
    from itertools import chain
    input_list = [ "hello", "help", "helium", "habit", "habitat", "hazard"]
    
    counter = Counter(
        chain.from_iterable(
            (entry[:i] for i in range(2,len(entry))) for entry in input_list
        )
    )
    
    counter_filtered = {x:y for x,y in counter.items() if y != 1}  # remove single entries
    
    print(counter_filtered)
    
    {'he': 3, 'hel': 3, 'ha': 3, 'hab': 2, 'habi': 2}