Search code examples
pythonlistindexinglevenshtein-distance

Filter duplicates from list based on Levenshtein distance


Let's say that I have a list of JSONs like in the example. Of those that have duplicate title attribute (as determined by scoring over a certain threshold of Levenshtein distance), I'd like to filter out the duplicates that do not have the minimum value in another attribute (sourceRank).

Here was my idea for how to do this, however, the indexing is broken. What is the most efficient way to accomplish this?

articles = [
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':4.0},
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':1.0},
    {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':2.0},
    {'_source': {'title':'Apple Pay Apple Pay Launches in Belgium and Kazakhstan', 'sourceRank':1.0},
    {'_source': {'title':'APPLE : Supreme Court weighs antitrust dispute over Apple App Store', 'sourceRank':3.0},
]

print len(articles)
print [a['_source']['title'] for a in articles]

def levenshtein_distance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

indices = []
for i1, a1 in enumerate(articles):
    for i2, a2 in enumerate(articles):
        if levenshtein_distance(a1['_source']['title'], a2['_source']['title']) > .8:
            if a1['_source']['sourceRank'] > a2['_source']['sourceRank']:
                indices += [i1]
            else:
                indices += [i2]
articles = [i for j, i in enumerate(articles) if j not in indices]

print len(articles)
print [a['_source']['title'] for a in articles]

Solution

  • The gist of your question seems to be removing duplicate titles from your list while ensuring the remaining title has the lowest sourceRank. I don't know how high the sourRank values could potentially be, so I just took a stab at 100000 for a sentinel value.

    #!/usr/bin/env python3
    
    import itertools
    
    
    articles = [
        {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':4.0}},
        {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':1.0}},
        {'_source': {'title':'Cyber Monday UK Apple deals 2018: MacBooks, iPhones, iPads and Apple Watches', 'sourceRank':2.0}},
        {'_source': {'title':'Apple Pay Apple Pay Launches in Belgium and Kazakhstan', 'sourceRank':1.0}},
        {'_source': {'title':'APPLE : Supreme Court weighs antitrust dispute over Apple App Store', 'sourceRank':3.0}}
    ]
    
    def reducer(iter_):
        max_rank = 100000
        retval = None
        for value in iter_:
            current_rank = value["_source"]["sourceRank"]
            if current_rank < max_rank:
                max_rank = current_rank
                retval = value
        return retval
    
    
    for title, _source in itertools.groupby(articles, lambda x: x["_source"].get("title")):
        print(reducer(_source))