Search code examples
pythonpandasdifflib

Speeding up a comparison function for comparing sentences


I have a data frame that has a shape of (789174, 9). There is a column called resolution that contains a sentence that is less than 139 characters in length. I built a function to find sentences that have a similarity score of above 0.9 from the difflib library. I have a virtual computer with 96 cpus and 384 gb of ram. I have been running this function for longer than 2 hours now and it still has not processed when i = 1000. I am concerned that this will take too long to process and I am wondering if there is a way to speed this up.

def replace_similars(input_list):
    # Replaces %90 and more similar strings
    start_time = time.time()
    for i in range(len(input_list)):
        if i % 1000 == 0:
            print(f'time = {time.time()-start_time:.2f} - index = {i}')
        for j in range(len(input_list)):
            if i < j and difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
                input_list[j] = input_list[i]

def generate_mapping(input_list):
    new_list = input_list[:]  # copy list
    replace_similars(new_list)

    mapping = {}
    for i in range(len(input_list)):
        mapping[input_list[i]] = new_list[i]

    return mapping

Clearly since we are iterating twice through the column it is O(n^2). I am not sure if there is a way to make this faster. Any suggestions would be greatly appreciated.

EDIT:

I have attempted a speed up using the difflib and fuzzywuzzy. The function only goes through the column once but I do iterate through the dictionary keys.

def cluster_resolution(df):
    clusters = {}
    for string in df['resolution_modified'].unique():
        match1 = difflib.get_close_matches(string, clusters.keys(), cutoff=0.9)
        
        if match1:
            for m in match1:
                clusters[m].append(string)
        else:           
            clusters[string] = [ string ]
            for m in clusters.keys():
                match2 = fuzz.partial_ratio(string, m)
                if match2 >= 90:
                    clusters[m].append(string)
    return clusters
mappings = cluster_resolution(df_sample)

Is it possible to speed up the latter function?

Here is an example of some data in a dataframe

d = {'resolution' : ['replaced scanner', 'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use', 'tc reimage', 'updated pc', 'deploying replacement scanner', 'upgraded and rebooted station', 'printer has been reconfigured', 'cleared linux print queue and now it is working','user reset her password successfully closing tt','have reset the printer to get it to print again','i plugged usb cable into port and scanner works','reconfigured hand scanner and linked to station','replaced the scanner with station is functional','laptops battery needed to be reset asset serial','reconfigured scanner confirmed that it scans as intended','reimaging laptop corrected the anyconnect software issue','printer was unplugged from usb port working properly now','reconnected usb cable and reassign printer ports on port','reconfigured scanner to base and tested with aa all fine','replaced the defective device with a fresh imaged laptop','reconfigured the printer and the media to print properly','tested printer at station connected and working resolved','red scanner reconfigured and base rebooted via usb joint','station scanner was synced to base and station and is now working','printer offlineswitched usb portprinter is now online and working','replaced the barcode label with one reflecting the tcs ip address','restarted the thin client by using ssh to run the restart command','printer reconfigured and test they are functioning normally again','removed old printer for service installed replacement tested good','tc required reboot rebooted tc had aa signin dp is now functional','resetting the printer to factory settings and then reconfigure it','updated windows os forced update and the laptop operated normally','printer settings are set correct and printer is working correctly','power to printer was disconnected reconnected and is working fine','power cycled equipment and restocked spooler with plastic bubbles','laptop checked ive logged into paskiplacowepl without any problem','reseated scanner cables connection into usb port to resolve issue','the scanner has been replaced and the station is working well now']}

df = pd.DataFrame(data=d)

How I define similarity:

Similarity is really defined by the overall action taken such as replaced scanner and replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use. The longer strings overall action was replacing the scanner thus those two are very similar which is why I chose to use the partial_ratio function since those have a score of 100.

Attention:

Please refer to the second function cluster_resolution as this is the function I would like to sped up. The latter function is not going to useful.


Solution

  • Regarding your last edit, I'd make a few changes (mainly using fuzzywuzzy.process rather than fuzzywuzzy.fuzz) :

    from fuzzywuzzy import process
    def cluster_resolution(df):
        clusters = {}
        for string in df['resolution'].unique():        
            match1 = difflib.get_close_matches(string, clusters.keys(), cutoff=0.9)
            if match1:
                for m in match1:
                    clusters[m].append(string)
            else:           
                bests = process.extractBests(
                        string, 
                        set(clusters.keys())-{string},
                        scorer=fuzz.partial_ratio,
                        score_cutoff=80,
                        limit=1
                        )
                
                if bests:
                    clusters[bests[0][0]].append(string)
                else:
                    clusters[string] = [ string ]
    

    But I think you could look more into other solutions, like CountVectorizer and whatever metric is adapted there. It is a way to gain speed (as it is vectorized), though the results may be imperfect. Note that CountVectorizer could be a good solution for you as you already made the choice of the partial_ratio.

    For example, something like this :

    from sklearn.feature_extraction.text import CountVectorizer
    from scipy.spatial.distance import pdist, squareform
    import hdbscan
    
    df = pd.DataFrame(d)
    
    cv = CountVectorizer(stop_words="english")
    transformed = cv.fit_transform(df['resolution'])
    transformed = pd.DataFrame(
            transformed.toarray(), 
            columns=cv.get_feature_names(),
            index=df['resolution'])
    
    #keep only columns with more than 1
    transformed = transformed[transformed.columns[transformed.sum()>2]]
    
    #compute the distance matrix
    d = pdist(transformed, metric="hamming") * transformed.shape[1]
    s = squareform(d)
    
    clusterer = hdbscan.HDBSCAN(metric='precomputed', min_cluster_size=2)
    clusterer.fit_predict(s)
    
    df['labels'] = clusterer.labels_
    
    print(df.sort_values('labels'))
    

    I think this is still perfectible (this is my first shot at text clustering...). You could also add your own list of stopwords for CountVectorizer which would be a way to help the algorithm. At the very least, it could help you pre-cluster your dataset before using your previous function, this way for instance :

    df.groupby('labels')['resolution'].apply(cluster_resolution)
    

    (That way if your first clustering is roughly ok, you would only check each value against all other values in the cluster, not all values).

    Credits to @anon01 for the computation of the distance matrix in this answer, which seems to give slightly better results than the default of hdbscan.

    Edit :

    Another try, including :

    • change of the metrics,
    • adding a step with a TF-IDF model,
    • and adding a step to lemmatize the words using nltk package.

    So this would be :

    from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
    from sklearn.pipeline import Pipeline
    from scipy.spatial.distance import pdist, squareform
    import pandas as pd
    import hdbscan
    import nltk
    from nltk.stem import WordNetLemmatizer 
    from nltk.corpus import wordnet
    
    d = {...}
    df = pd.DataFrame(d)
    
    lemmatizer = WordNetLemmatizer()
    
    def lemmatization(sentence):
        
        tag_dict = {
                    "J": wordnet.ADJ,
                    "N": wordnet.NOUN,
                    "V": wordnet.VERB,
                    "R": wordnet.ADV,
                    }
    
        # Tokenize the sentence
        wordsList = nltk.word_tokenize(sentence) 
        
        # Find the right token
        tagged = nltk.pos_tag(wordsList)   
        
        # Convert the list of (token, tag) to lemmatized tokens
        lems = [
                lemmatizer.lemmatize(token, tag_dict.get(tag[0], wordnet.NOUN) )
                for token, tag
                in tagged
                ]
    
        lems = ' '.join(lems)
        return lems
    
    df['lemmatized'] = df['resolution'].apply(lemmatization)
    
    corpus = df['lemmatized']
    pipe = Pipeline(
            [
                    ('cv', CountVectorizer(stop_words="english")),
                    ('tfid', TfidfTransformer())
             ]).fit(corpus)
    
    transformed = pipe.transform(corpus)
    transformed = pd.DataFrame(
            transformed.toarray(), 
            columns=pipe.named_steps['cv'].get_feature_names(),
            index=df['resolution'])
    
    d = pdist(transformed, metric="cosine") * transformed.shape[1]
    s = squareform(d)
    
    clusterer = hdbscan.HDBSCAN(metric="precomputed", min_cluster_size=2)
    clusterer.fit_predict(s)
    
    df['labels'] = clusterer.labels_
    
    print(df.sort_values('labels'))
    

    You could also add some specific code, as your sample seems to be about very specific maintenance logs.

    For instance, you could add new features to the transformed dataframe based on a small list of hardware/sotfware :

    #To create a feature about OS :
    cols = ['os', 'linux', 'window']
    transformed[cols[0]] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))
    
    #To crate a feature about hardware :
    cols = ["laptop", "printer", "scanner"]
    transformed["hardware"] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))
    

    This step could help have better results but may not be necessary. I'm not sure of how it will compares to FuzzyWuzzy's performance for matching strings, but I will be interested in your feedback !