How to club similar words under one key in python dictionary

I am learning Text Processing and am stuck I have a dataset of survey about which website a user spends his money on while shopping

i have data of the form : amazon,amzn ,amazon prime,amazon.com ,amzn prim,etc

Now i want to create a dictionary which clubs the similar values under one key like

dict1 = {"AMAZON":["amazon","amzn","amazon prime","amazon.com", "amzn prim"], 
         "Coursera" : ["coursera","corsera","coursera.org","coursera.com"]}

The main goal of the dictionary is to create another column in the dataframe with the key value of each website name

I have tried fuzzywuzzy but am unable to understand how to club the similar values under one key

Thanks :)

Solution

Your task is to associate a response str with the correct key str from a list of pre-defined keys. Therefore, you need to compare a given str response (e.g. "amazon.com") with each of your keys ["AMAZON", "Coursera"] and pick the key that displays the highest similarity with respect to some metric.

1. Manual choice of Keys

Choosing a suitable metric on strings is the tricky part as they merely treat them as arrays of characters. No consideration is given to the semantics of the words and no domain knowledge is involved. In turn, I'd suggest a manual matching if the number of keys is low. Python's built-in string class str provides lower() to make the comparison invariant invariant the in-Operator checks for membership of a substring. This is a good starting point.

def getKey(website:str):
    # case-insensitive
    website = website.lower() 
    
    # 1. handcrafted key-pattern matching
    refDict = dict()
    refDict['AMAZON']   = ["amzn", "amazon"]
    refDict['COURSERA'] = ["coursera", "amazon"]

    for k,v in refDict.items():
        if any([pattern in website for pattern in v]):
            return k
    
    # if no match found
    return ""

For a Pandas frame this yields

df = pd.DataFrame({'website': ['amazon', 'amzn.com', 'coursera', 'corsera', 'cosera', 'save-the-amazon-forest.org']})
df['key'] = [getKey(website) for website in df['website']]
>df

As you can see, this string comparison is inherently brittle, too. In addition, the order of the keys in the dictionary matters. Note that only since Python 3.6, dictionaries maintain insertion order by default. If you use an earlier version using OrderedDict to keep control of the order.

If you can enforce users to write the proper URL, you might want to consider extracting it from the string via regular expression and use it directly as the key. This would save you the time to list keys and matching patterns manually in getKey() altogether. It is presented in here.

2. Automatic keys via unsupervised learning

Since the additional requirement was raised that the algorithm needs to find the keys in an unsupervised fashion, the following code invokes the Edit (Levenshtein) distance and clustering to do exactly that.

import pandas as pd
import numpy as np
from sklearn.cluster import AffinityPropagation
from nltk.cluster.kmeans import KMeansClusterer
from nltk.metrics import *


# example input
websiteList = ["amazon", "apple.com", "amzn", "amazon prime" , "amazon.com", "cosera", "apple inc.", "amzn prim", "coursera", 
               "coursera", "coursera.org", "coursera.com", "StackOverFlow.com", "stackoverflow", "stack-overflow.com", 
               "corsing", "apple", "AAPL"]
websiteListRaw = list(websiteList) # copy for later

df = pd.DataFrame({'website' : websiteList})


def minEditDistance(s1, s2):
    '''Minimum edit distance across all pairwise input (sub-)strings'''
    ptrList_1 = s1.split(' ') + [s1]
    ptrList_2 = s2.split(' ') + [s2]
    
    return min([edit_distance(x_i, x_j) for x_i in ptrList_1 for x_j in ptrList_2])

# lowercase
websiteList = [site.lower() for site in websiteList]
N = len(websiteList)

# delete suffixes
suffixList = ['.com', '.org', 'co.uk', '.eu']
for i in range(N):
    for suffix in suffixList:
        websiteList[i] = websiteList[i].removesuffix(suffix)

# replace special characters
specialSymbolList = ['/', '-', '*']
for i in range(N):
    for symbol in specialSymbolList:
        websiteList[i] = websiteList[i].replace(symbol, ' ')
    
# similarity = -1 * distance
responses = np.array(websiteList) 
minEditSimilarity = (-1.0)*np.array([[minEditDistance(w1,w2) for w1 in responses] for w2 in responses])

# clustering
affprop = AffinityPropagation(affinity="precomputed", damping=0.54, random_state=77)
affprop.fit(minEditSimilarity)

# return
matchDict = dict()
for cluster_id in np.unique(affprop.labels_):
    exemplar = responses[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(responses[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))
    # assign
    for resp in cluster:
        match_indices = [i for i, name in enumerate(websiteList) if name==resp]
        for resp_index in match_indices:
            matchDict[websiteListRaw[resp_index]] = exemplar.split(' ')[0].upper()
        
    print('exemplar: ', exemplar)
        
# add learned keys
df['key'] = df['website'].replace(matchDict)
df