I have a list of keys ['foo_a','foo_b','foo_c','fnord']
All of the similar solutions here assume that you have no fnord
's in your text.
I have this code that does the job:
def detect_prefix(keys):
PCT = 0.70 # cutof
pre = ''
l = len(keys)
for i in range(0, len(max(keys, key=len))):
keys = filter(lambda k: k.startswith(pre), keys)
cnt = dict()
for k in map(lambda k: k[i], keys):
cnt.setdefault(k,0)
cnt[k] +=1
if cnt[max(cnt)] / float(l) >= PCT:
pre += max(cnt)
else:
break
return pre
I have a strong suspicion that this could be done much more elegantly, but my python-fu is not strong enough right now.
I would love to hear some suggestions.
Edit.
Additional background and clarification.
These are keys that other developers have put in an application to use for translation.
They should have a common prefix, but people forget, and they cut and paste from other code. "_" as a prefix separator is just a convention. It would be best not to assume a separator is even used. 70% is a totally arbitrary threshold. "most prevalent" or "predominant" would have worked too.
And yes, this is python 2.7, and the space inside the quotes is just a visual artefact.
If you know the desired threshold frequency for the common prefix:
#!/usr/bin/env python
from collections import Counter
from itertools import izip_longest
strings = ['foo_a','foo_b','foo_c','fnord']
threshold = .7 * len(strings)
prefix = []
for chars in izip_longest(*strings, fillvalue=''):
char, count = Counter(chars).most_common(1)[0]
if count < threshold:
break
prefix.append(char)
print(''.join(prefix))
# -> foo_
Or you could collect all common prefixes and their frequencies and decide later:
#!/usr/bin/env python
from collections import Counter
from itertools import izip_longest
strings = ['foo_a', 'foo_b','foo_c','fnord']
assert len(strings) > 1
threshold = len(strings)
prefix = []
prefixes = []
for chars in izip_longest(*strings, fillvalue=''):
char, count = Counter(chars).most_common(1)[0]
if count == 1:
break
elif count < threshold:
if prefix:
prefixes.append((''.join(prefix), threshold))
threshold = count
prefix.append(char)
if prefix:
prefixes.append((''.join(prefix), threshold))
print(prefixes)
# -> [('f', 4), ('foo_', 3)]
Both code examples assume that the predominant prefix exists i.e., the most common character at each position belongs to the most common prefix.