python list indexing levenshtein-distance

Filter duplicates from list based on Levenshtein distance

Let's say that I have a list of JSONs like in the example. Of those that have duplicate title attribute (as determined by scoring over a certain threshold of Levenshtein distance), I'd like to filter out the duplicates that do not have the minimum value in another attribute (sourceRank).

Here was my idea for how to do this, however, the indexing is broken. What is the most efficient way to accomplish this?

articles = [
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':4.0},
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':1.0},
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':2.0},
    {'_source': {'title':'Apple Pay Apple Pay Launches in Belgium and Kazakhstan', 'sourceRank':1.0},
    {'_source': {'title':'APPLE : Supreme Court weighs antitrust dispute over Apple App Store', 'sourceRank':3.0},
]

print len(articles)
print [a['_source']['title'] for a in articles]

def levenshtein_distance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

indices = []
for i1, a1 in enumerate(articles):
    for i2, a2 in enumerate(articles):
        if levenshtein_distance(a1['_source']['title'], a2['_source']['title']) > .8:
            if a1['_source']['sourceRank'] > a2['_source']['sourceRank']:
                indices += [i1]
            else:
                indices += [i2]
articles = [i for j, i in enumerate(articles) if j not in indices]

print len(articles)
print [a['_source']['title'] for a in articles]

Solution

The gist of your question seems to be removing duplicate titles from your list while ensuring the remaining title has the lowest sourceRank. I don't know how high the sourRank values could potentially be, so I just took a stab at 100000 for a sentinel value.

#!/usr/bin/env python3

import itertools


articles = [
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':4.0}},
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':1.0}},
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':2.0}},
    {'_source': {'title':'Apple Pay Apple Pay Launches in Belgium and Kazakhstan', 'sourceRank':1.0}},
    {'_source': {'title':'APPLE : Supreme Court weighs antitrust dispute over Apple App Store', 'sourceRank':3.0}}
]

def reducer(iter_):
    max_rank = 100000
    retval = None
    for value in iter_:
        current_rank = value["_source"]["sourceRank"]
        if current_rank < max_rank:
            max_rank = current_rank
            retval = value
    return retval


for title, _source in itertools.groupby(articles, lambda x: x["_source"].get("title")):
    print(reducer(_source))