Search code examples
pythonpandasdifflibsequencematcher

Print Rows that are "Near Duplicates" in Pandas DataFrame


I'm working on a personal project that performs Web Scraping on multiple databases of research articles (thus far I have done PubMed and Scopus) and extracts the titles of the articles. I've actually managed to pull this off on my own without problem. Next, I've combined the list of articles into a Pandas DataFrame with two columns: Article and Database. I wanted to remove duplicates across the two article databases and used df = df.drop_duplicates(subset='Article') to remove exact matches.

BUT, what if the articles are "near matches", that is, perhaps a word in the title was misspelled, or there is an extra blank space somewhere in the title (not at the end, I've proofed using lstrip() and rstrip()).

I have explored string matching in the past using SequenceMatcher from difflib, but never in a DataFrame. So, my question is, how would I code the following conditional so that I can review the near similar values:

"if row in df['Article'] is 95% similar to another row in df['Article'], print both rows."

I started doing some testing using separate columns like such:

letters1 = ['a','b','c','a','b']
letters2 = ['c','b','a','a','c']
numbers = [1,2,3,4,5]

data = {'Letters1':letters,
        'Letters2':letters2,
        'Numbers':numbers}

test = pd.DataFrame(data)
test['result'] = ''

for i in test['Letters1'].index:
    if SequenceMatcher(None, test['Letters1'], test['Letters2']).ratio() > 0:
        test['result'] = 'True'
    else:
        test['result'] = 'False'

test.head()

However, I'm already not getting the desired results and thought to seek help here first. Any suggestions? To reiterate, I don't want to use two columns ultimately, I am just using the example code block above to start testing how to do this.


Solution

  • The unexpected result in your code is due to using whole columns instead of items. You can fix that for example by using the .at accessor

    for i in test.index:
        if SequenceMatcher(None, test.at[i, 'Letters1'], test.at[i, 'Letters2']).ratio() > 0:
            test.at[i, 'result'] = True
        else:
            test.at[i, 'result'] = False
    

    or more compact by

    test["result"] = test.apply(
        lambda r: SequenceMatcher(None, r.at['Letters1'], r.at['Letters2']).ratio() > 0,
        axis=1
    )
    

    Result for the sample:

      Letters1 Letters2  Numbers result
    0        a        c        1  False
    1        b        b        2   True
    2        c        a        3  False
    3        a        a        4   True
    4        b        c        5  False
    

    As an alternative you could do something like:

    from itertools import combinations
    
    # Sample dataframe
    df = pd.DataFrame({'Letters': ['a', 'b', 'c', 'a', 'b']})
    
    for i, j in combinations(df.index, r=2):
        txt1, txt2 = df.at[i, "Letters"], df.at[j, "Letters"]
        if SequenceMatcher(None, txt1, txt2).ratio() > 0:
            print((i, txt1), (j, txt2))
    

    Output:

    (0, 'a') (3, 'a')
    (1, 'b') (4, 'b')