input list:
['A.B.C.D','A.A.C.D','A.B.E.F','A.B.E.GG']
The strings are variables concatenated with a dot. So in first string, variables are A B C D, but in last string variable are A B E GG (variables can be of varying length), but the strings will always have the same number of variables separated by the dot. I would like to group those strings together which have only one different variable. So above would produce 2 groups.
['A.B.C.D','A.A.C.D'] ['A.B.E.F','A.B.E.GG']
The the difference has to be in the same variable. not one difference across all variables. and each string should be included only once, not multiple times across groups.
Real data could have up to 20 variables (but all items in each list will always have the same number of variables), and lists could have several minion strings, so performance is also a concern.
I wrote an algorithm which does it but is too complicated. I also tried via itertools groupby but could not get it to produce the correct results:
import itertools
import difflib
class Grouper:
def __init__(self, diff):
self.last = None
self.diff = diff
def get_diffs(self, curr_key):
if self.last == None:
return []
dims = curr_key.split('.')
previous_dims = self.last.split('.')
diffs = list((dm for dm in difflib.ndiff(dims, previous_dims) if '-' not in dm and '?' not in dm))
return [n for n, d in enumerate(diffs) if '+' in d]
def __call__(self, item):
result = self.get_diffs(item)
print(item)
self.last = item
return result
grp = Grouper(data)
groups = []
uniquekeys = []
for k, g in itertools.groupby(data, grp):
groups.append(list(g)) # Store group iterator as a list
uniquekeys.append(k)
UPDATE:
More sample input:
['D.1.2.A.1.B.C', 'D.7.2.A.1.B.C', 'D.21.2.A.1.B.C', 'D.15.6.A.1.B.C', 'D.25.6.A.1.B.C', 'D.7.2.A.1.B.C', 'D.8.2.A.1.B.C', 'D.8.6.A.1.B.C', 'D.10.2.A.1.B.C', 'D.14.2.A.1.B.C', 'D.15.2.A.1.B.C', 'D.15.6.A.1.B.C', 'D.16.2.A.1.B.C', 'D.17.2.A.1.B.C', 'D.18.2.A.1.B.C', 'D.19.2.A.1.B.C', 'D.20.2.A.1.B.C', 'D.21.2.A.1.B.C', 'D.22.2.A.1.B.C', 'D.23.2.A.1.B.C', 'D.25.2.A.1.B.C', 'D.25.6.A.1.B.C', 'D.26.2.A.1.B.C', 'D.27.2.A.1.B.C']
I added another string "A.B.C.F", this string can be matched with "A.B.C.D" and "A.B.E.F":
import itertools
def main(l: list) -> list:
splits = [tuple(s.split(".")) for s in l]
groups = {}
# Dict of {(tuple, index of difference): list of tuples that match the key}
for split in splits:
matched = False
for (t, i) in groups.keys():
# We match two splits if they only have one difference
diffs = [i for i in range(len(split)) if split[i] != t[i]]
if len(diffs) == 1:
# The strings match but is the first match of 't'?
if i is None:
groups.pop((t, i))
groups[(t, diffs[0])] = [split]
# We found a match, stop the matching of this tuple
matched = True
break
elif diffs[0] == i:
groups[(t, i)].append(split)
matched = True
break
if not matched:
# Let's add this split as a candidate for future splits
groups[(split, None)] = []
return [[".".join(k)] + [".".join(s) for s in v] for (k, i), v in groups.items()]
print(main(["A.B.C.D", "A.A.C.D", "A.B.E.F", "A.B.E.GG", "A.B.C.F"]))
# [['A.B.C.D', 'A.A.C.D'], ['A.B.E.F', 'A.B.E.GG'], ['A.B.C.F']]
print(main(["A.B.C", "A.B.D", "A.E.D"]))
# [['A.B.C', 'A.B.D'], ['A.E.D']]