Search code examples
pythonalgorithmpython-2.7prefixrobust

Find *most* common prefix of strings - a better way?


I have a list of keys ['foo_a','foo_b','foo_c','fnord']

All of the similar solutions here assume that you have no fnord's in your text.

I have this code that does the job:

def detect_prefix(keys):
    PCT = 0.70 # cutof 
    pre = ''
    l = len(keys)
    for i in range(0, len(max(keys, key=len))):
        keys = filter(lambda k: k.startswith(pre), keys)
        cnt = dict()
        for k in map(lambda k: k[i], keys):
            cnt.setdefault(k,0)
            cnt[k] +=1
        if cnt[max(cnt)] / float(l) >= PCT:
            pre += max(cnt)
        else:
            break
    return pre

I have a strong suspicion that this could be done much more elegantly, but my python-fu is not strong enough right now.

I would love to hear some suggestions.

Edit. Additional background and clarification.
These are keys that other developers have put in an application to use for translation. They should have a common prefix, but people forget, and they cut and paste from other code. "_" as a prefix separator is just a convention. It would be best not to assume a separator is even used. 70% is a totally arbitrary threshold. "most prevalent" or "predominant" would have worked too.
And yes, this is python 2.7, and the space inside the quotes is just a visual artefact.


Solution

  • If you know the desired threshold frequency for the common prefix:

    #!/usr/bin/env python
    from collections import Counter
    from itertools import izip_longest
    
    strings = ['foo_a','foo_b','foo_c','fnord']
    threshold = .7 * len(strings)
    prefix = []
    for chars in izip_longest(*strings, fillvalue=''):
        char, count = Counter(chars).most_common(1)[0]
        if count < threshold:
            break
        prefix.append(char)
    print(''.join(prefix))
    # -> foo_
    

    Or you could collect all common prefixes and their frequencies and decide later:

    #!/usr/bin/env python
    from collections import Counter
    from itertools import izip_longest
    
    strings = ['foo_a', 'foo_b','foo_c','fnord']
    assert len(strings) > 1
    threshold = len(strings)
    prefix = []
    prefixes = []
    for chars in izip_longest(*strings, fillvalue=''):
        char, count = Counter(chars).most_common(1)[0]
        if count == 1:
            break
        elif count < threshold:
            if prefix:
                prefixes.append((''.join(prefix), threshold))
            threshold = count
        prefix.append(char)
    if prefix:
        prefixes.append((''.join(prefix), threshold))
    print(prefixes)
    # -> [('f', 4), ('foo_', 3)]
    

    Both code examples assume that the predominant prefix exists i.e., the most common character at each position belongs to the most common prefix.