Search code examples
pythonarrayspython-3.xalgorithmstring-matching

Counting strings in lists and then filtering & matching, in python


I have a list of words, and with python3 I count the difference in letters between each combination of words (using a clever diff_summing algorithm from this site):

import itertools

def diff_letters(a,b):
    return sum ( a[i] != b[i] for i in range(len(a)) )

w = ['AAHS','AALS','DAHS','XYZA']

for x,y in itertools.combinations(w,2):
    if diff_letters(x,y) == 1:
        print(x,y)

This prints:

AAHS AALS
AAHS DAHS

My question: How can I count and record that strings 'DAHS' and 'AALS' have exactly one partner, and 'AAHS' has two partners? I'll be filtering for directional combinations where each target_string has exactly one near_matching_word, so my final data would (as a JSON) look like this:

[
 {
   "target_word": "DAHS",
   "near_matching_word": "AAHS"
 },
 {
   "target_word": "AALS",
   "near_matching_word": "AAHS"
 }
]

(noticing that AAHS doesn't appear as a target_word)

I have one version using functools.reduce

import itertools
import functools
import operator

def diff_letters(a,b):
    return sum ( a[i] != b[i] for i in range(len(a)) )

w = ['AAHS','AALS','DAHS','XYZA']

pairs = []
for x,y in itertools.combinations(w,2):
    if diff_letters(x,y) == 1:
        #print(x,y)
        pairs.append((x,y))

full_list = functools.reduce(operator.add, pairs)
for x in full_list:
    if full_list.count(x) == 1:
        print (x)

which prints

AALS
DAHS

but then I would have to go back to my big list pairs to find the near_matching_word. Of course, in my final version, list pairs will be much larger, and the target_word could be either the 1st or 2nd item in the tuple (x,y).


Solution

  • You could use a dictionary instead of a list of pairs:

    pairs = {}
    for x, y in itertools.combinations(w, 2):
        if diff_letters(x, y) == 1:
            pairs.setdefault(x, []).append(y)
            pairs.setdefault(y, []).append(x)
    
    result = [{ "target_word": key, "near_matching_word": head, } for key, (head, *tail) in pairs.items() if not tail]
    
    print(result)
    

    Output

    [{'target_word': 'AALS', 'near_matching_word': 'AAHS'}, {'target_word': 'DAHS', 'near_matching_word': 'AAHS'}]
    

    In the pairs dictionary the keys are the target_words and the values are the near_matching_words. Then use a list comprehension to filter out those that have more that 1 near_matching_word.