Search code examples
python-3.xfuzzywuzzyfuzzy-comparisonpython-re

Sum all counts when their fuzz.WRatio > 90 otherwise leave intact


What I want to do was actually group by all similar strings in one columns and sum their corresponding counts if there are similarity, otherwise, leave them. A little similar to this post. Unfortunately I have not been able to apply this to my case: How to group Pandas data frame by column with regex match

Unfortunately, I ended up with the following steps:

I wrote a function to print out all the fuzz.Wratio for each row of string, when each row does a linear search from the top to check if there are other similar strings in the rest of the rows. If the WRatio > 90, I would like to sum these row's corresponding counts. Otherwise, leave them there.

I created a test data looking like this:

test_data=pd.DataFrame({
    'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
    'count':[4,3,2,6]
})

So what I want to do is make the result as a dataframe like:

result=pd.Dataframe({
    'Nname':['Apple.Inc.','OMEGA'],
    'Ncount':[9,6]
})

My function so far only gave me the fuzz ratio for each row, and to my understanding is that, each row compares to itself three times( here we have four rows). So My function output would look like:

pd.Dataframe({
    'Nname':['Apple.Inc.','Apple.Inc.','Apple.Inc.','apple.inc',\
    'apple.inc','apple.inc'],
    'Ncount':[4,4,4,3,3,3],
    'FRatio': [100,100,100,100,100,100] })

This is just one portion of the whole output from the function I wrote with this test data. And the last row "OMEGA" would give me a fuzz ratio about 18.

My function is like this:

def checkDupTitle2(data):
    Nname=[]
    Ncount=[]
    f_ratio=[]

    for i in range(0, len(data)):
        current=0
        count=0
        space=0
        for space in range(0, len(data)-1-current):
            ratio=fuzz.WRatio(str(data.loc[i]['name']).strip(), \
                        str(data.loc[current+space]['name']).strip())
            Nname.append(str(data.loc[i]['name']).strip())
            Ncount.append(str(data.loc[i]['count']).strip())
            f_ratio.append(ratio)
            df=pd.DataFrame({
                'Nname': Nname,
                'Ncount': Ncount,
                'FRatio': f_ratio
            })
    return df

So after running this function and get the output, I tried to get what I eventually want. here I tried group by on the df created above:

output.groupby(output.FRatio>90).sum()

But this way, I still need a "name" in my dataframe, how can I decide on which names for this total counts, say, 9 here. "Apple.Inc" or "apple.inc" or "APPLE.INC"?

Or, did I make it too complex? Is there a way to group by "name" at the very first and treat "Apple.Inc.", "apple.inc" and "APPLE.INC" all the same, then my problem has solved. I have stump quite a while. Any helps would be highly appreciated! Thanks!


Solution

  • The following code is using my library RapidFuzz instead of FuzzyWuzzy since it is faster and it has a process method extractIndices which does help here. This solution is quite a bit faster, but since I do not work with pandas regulary I am sure there are still some things that could be improved :)

    import pandas as pd
    from rapidfuzz import process, utils
    
    def checkDupTitle(data):
        values = data.values.tolist()
        companies = [company for company, _ in values]
        pcompanies = [utils.default_process(company) for company in companies]
        counts = [count for _, count in values]
    
        results = []
        while companies:
            company = companies.pop(0)
            pcompany = pcompanies.pop(0)
            count = counts.pop(0)
    
            duplicates = process.extractIndices(
                pcompany, pcompanies,
                processor=None, score_cutoff=90, limit=None)
    
            for (i, _) in sorted(duplicates, reverse=True):
                count += counts.pop(i)
                del pcompanies[i]
                del companies[i]
            results.append([company, count])
    
        return pd.DataFrame(results, columns=['Nname','Ncount'])
    
    test_data=pd.DataFrame({
        'name':['Apple.Inc.','apple.inc','APPLE.INC','OMEGA'],
        'count':[4,3,2,6]
    })
    
    checkDupTitle(test_data)
    

    The result is

    pd.Dataframe({
        'Nname':['Apple.Inc.','OMEGA'],
        'Ncount':[9,6]
    })