Search code examples
pythonpandasdataframedifflib

pandas get_close_matches - returning empty value


I am working on a requirement, there are 2 CSV as below -

CSV1.csv

   Short Description                                                    Category
    Device is DOWN!                                                      Server Down
    CPU Warning Monitoron  XSSXSXSXSXSX.com                              CPU Utilization
    CPU Warning Monitoron  XSSXSXSXSXSX.com                              CPU Utilization
    CPU Warning Monitoron  XSSXSXSXSXSX.com                              CPU Utilization
    CPU Warning Monitoron  XSSXSXSXSXSX.com                              CPU Utilization
    Device Performance Alerts was triggered on Physical memory           Memory Utilization
    Device Performance Alerts was triggered on Physical memory           Memory Utilization
    Device Performance Alerts was triggered on Physical memory           Memory Utilization
    Disk Space Is Lowon ;E:                                              Disk Space Utilization
    Disk Space Is Lowon;C:                                               Disk Space Utilization
    Network Interface Down                                               Interface Down
    Active Directory                                                     

and reference.csv

Category                         Complexity
Server Down                      Simple
Network Interface down           Complex
Drive Cleanup Windows            Medium
CPU Utilization                  Medium
Memory Utilization               Medium
Disk Space Utilization Unix      Simple
Windows Service Restart          Medium
UNIX Service Restart             Medium
Web Tomcat Instance Restart      Simple

Expected Output

Short Description                                                    Category                    Complexity
Device is DOWN!                                                      Server Down                 Simple
CPU Warning Monitoron  XSSXSXSXSXSX.com                              CPU Utilization             Medium
CPU Warning Monitoron  XSSXSXSXSXSX.com                              CPU Utilization             Medium
CPU Warning Monitoron  XSSXSXSXSXSX.com                              CPU Utilization             Medium
CPU Warning Monitoron  XSSXSXSXSXSX.com                              CPU Utilization             Medium
Device Performance Alerts was triggered on Physical memory           Memory Utilization          Medium
Device Performance Alerts was triggered on Physical memory           Memory Utilization          Medium
Device Performance Alerts was triggered on Physical memory           Memory Utilization          Medium
Disk Space Is Lowon ;E:                                              Disk Space Utilization      Medium
Disk Space Is Lowon;C:                                               Disk Space Utilization      Medium
Network Interface Down                                               Interface Down              Complex

I tried below code - But in the output dataframe I can see blank [] not sure what I am missing. In output complexity column i can see just [] for each row. I have tried to get exact match but i need to get all the possible combinations so I am using get_close_matches. How to pass possibility argument which is in dataframe in below code, I am not figure out the way to pass the possibility.

I have tried few other possibilities like exact but that has not given result as expected as I am looking for the all possible combinations while comparing the columns with the string

import pandas as pd
import difflib
df1 = pd.read_csv('csv1.csv')
df1 = df1.fillna('')
df2 = pd.read_csv('reference.csv')
my_dict = dict(zip(df2['Category'].values, df2['Complexity'].values))
def match_key(key, default_value):
    if not key:
        return default_value
    for d_key in my_dict.keys():
        if key in d_key or d_key in key:
            return my_dict[d_key]

    return default_value
df1['Complexity'] = df1['Category'].apply(lambda x: difflib.get_close_matches(x, list(my_dict.keys(), n=1)))
df1 = df1.explode('Complexity')
df1['Complexity'] = df1['Complexity'].map(my_dict)
print(df1)

Solution

  • difflib.get_close_matches expects the first argument to be the 'word', x in your case, and the second argument to be the 'possibilities'. Which you've provided as an empty string. This is why your function isn't working, it's trying to match a word with nothing basically.

    my_dict contains the valid options as keys, so we can use them as the list of 'possibilities'

    # Use n=1, so only tries to get 1 match
    df1['Complexity'] = df1['Category'].apply(lambda x: difflib.get_close_matches(x, list(my_dict.keys()), n=1))
    # The output of get_close_matches is a list, we use explode to convert it to a string
    df1 = df1.explode('Complexity')
    # We can now apply our map, to the *Complexity* column, 
    # which is technically the best match *Category*, via get_close_matches
    df1['Complexity'] = df1['Complexity'].map(my_dict)
    

    Original Bad Answer

    However, rather than continue using the difflib, I think you can change your approach. You want to apply my_dict to the Category column of df1. This is traditionally known as applying a map. pandas has this implementation ready via pandas.Series.map.

    df1['Complexity'] = df1['Category'].map(my_dict)