Search code examples
pythonpandasnlpsimilarity

How to club similar words under one key in python dictionary


I am learning Text Processing and am stuck I have a dataset of survey about which website a user spends his money on while shopping

i have data of the form : amazon,amzn ,amazon prime,amazon.com ,amzn prim,etc

Now i want to create a dictionary which clubs the similar values under one key like

dict1 = {"AMAZON":["amazon","amzn","amazon prime","amazon.com", "amzn prim"], 
         "Coursera" : ["coursera","corsera","coursera.org","coursera.com"]} 

The main goal of the dictionary is to create another column in the dataframe with the key value of each website name

I have tried fuzzywuzzy but am unable to understand how to club the similar values under one key

Thanks :)


Solution

  • Your task is to associate a response str with the correct key str from a list of pre-defined keys. Therefore, you need to compare a given str response (e.g. "amazon.com") with each of your keys ["AMAZON", "Coursera"] and pick the key that displays the highest similarity with respect to some metric.

    1. Manual choice of Keys

    Choosing a suitable metric on strings is the tricky part as they merely treat them as arrays of characters. No consideration is given to the semantics of the words and no domain knowledge is involved. In turn, I'd suggest a manual matching if the number of keys is low. Python's built-in string class str provides lower() to make the comparison invariant invariant the in-Operator checks for membership of a substring. This is a good starting point.

    def getKey(website:str):
        # case-insensitive
        website = website.lower() 
        
        # 1. handcrafted key-pattern matching
        refDict = dict()
        refDict['AMAZON']   = ["amzn", "amazon"]
        refDict['COURSERA'] = ["coursera", "amazon"]
    
        for k,v in refDict.items():
            if any([pattern in website for pattern in v]):
                return k
        
        # if no match found
        return ""
    

    For a Pandas frame this yields

    df = pd.DataFrame({'website': ['amazon', 'amzn.com', 'coursera', 'corsera', 'cosera', 'save-the-amazon-forest.org']})
    df['key'] = [getKey(website) for website in df['website']]
    >df
    

    enter image description here

    As you can see, this string comparison is inherently brittle, too. In addition, the order of the keys in the dictionary matters. Note that only since Python 3.6, dictionaries maintain insertion order by default. If you use an earlier version using OrderedDict to keep control of the order.

    If you can enforce users to write the proper URL, you might want to consider extracting it from the string via regular expression and use it directly as the key. This would save you the time to list keys and matching patterns manually in getKey() altogether. It is presented in here.

    2. Automatic keys via unsupervised learning

    Since the additional requirement was raised that the algorithm needs to find the keys in an unsupervised fashion, the following code invokes the Edit (Levenshtein) distance and clustering to do exactly that.
    
    import pandas as pd
    import numpy as np
    from sklearn.cluster import AffinityPropagation
    from nltk.cluster.kmeans import KMeansClusterer
    from nltk.metrics import *
    
    
    # example input
    websiteList = ["amazon", "apple.com", "amzn", "amazon prime" , "amazon.com", "cosera", "apple inc.", "amzn prim", "coursera", 
                   "coursera", "coursera.org", "coursera.com", "StackOverFlow.com", "stackoverflow", "stack-overflow.com", 
                   "corsing", "apple", "AAPL"]
    websiteListRaw = list(websiteList) # copy for later
    
    df = pd.DataFrame({'website' : websiteList})
    
    
    def minEditDistance(s1, s2):
        '''Minimum edit distance across all pairwise input (sub-)strings'''
        ptrList_1 = s1.split(' ') + [s1]
        ptrList_2 = s2.split(' ') + [s2]
        
        return min([edit_distance(x_i, x_j) for x_i in ptrList_1 for x_j in ptrList_2])
    
    # lowercase
    websiteList = [site.lower() for site in websiteList]
    N = len(websiteList)
    
    # delete suffixes
    suffixList = ['.com', '.org', 'co.uk', '.eu']
    for i in range(N):
        for suffix in suffixList:
            websiteList[i] = websiteList[i].removesuffix(suffix)
    
    # replace special characters
    specialSymbolList = ['/', '-', '*']
    for i in range(N):
        for symbol in specialSymbolList:
            websiteList[i] = websiteList[i].replace(symbol, ' ')
        
    # similarity = -1 * distance
    responses = np.array(websiteList) 
    minEditSimilarity = (-1.0)*np.array([[minEditDistance(w1,w2) for w1 in responses] for w2 in responses])
    
    # clustering
    affprop = AffinityPropagation(affinity="precomputed", damping=0.54, random_state=77)
    affprop.fit(minEditSimilarity)
    
    # return
    matchDict = dict()
    for cluster_id in np.unique(affprop.labels_):
        exemplar = responses[affprop.cluster_centers_indices_[cluster_id]]
        cluster = np.unique(responses[np.nonzero(affprop.labels_==cluster_id)])
        cluster_str = ", ".join(cluster)
        print(" - *%s:* %s" % (exemplar, cluster_str))
        # assign
        for resp in cluster:
            match_indices = [i for i, name in enumerate(websiteList) if name==resp]
            for resp_index in match_indices:
                matchDict[websiteListRaw[resp_index]] = exemplar.split(' ')[0].upper()
            
        print('exemplar: ', exemplar)
            
    # add learned keys
    df['key'] = df['website'].replace(matchDict)
    df
    

    enter image description here