I am working on a requirement, there are 2 CSV as below -
CSV1.csv
Short Description Category
Device is DOWN! Server Down
CPU Warning Monitoron XSSXSXSXSXSX.com CPU Utilization
CPU Warning Monitoron XSSXSXSXSXSX.com CPU Utilization
CPU Warning Monitoron XSSXSXSXSXSX.com CPU Utilization
CPU Warning Monitoron XSSXSXSXSXSX.com CPU Utilization
Device Performance Alerts was triggered on Physical memory Memory Utilization
Device Performance Alerts was triggered on Physical memory Memory Utilization
Device Performance Alerts was triggered on Physical memory Memory Utilization
Disk Space Is Lowon ;E: Disk Space Utilization
Disk Space Is Lowon;C: Disk Space Utilization
Network Interface Down Interface Down
Active Directory
and reference.csv
Category Complexity
Server Down Simple
Network Interface down Complex
Drive Cleanup Windows Medium
CPU Utilization Medium
Memory Utilization Medium
Disk Space Utilization Unix Simple
Windows Service Restart Medium
UNIX Service Restart Medium
Web Tomcat Instance Restart Simple
Expected Output
Short Description Category Complexity
Device is DOWN! Server Down Simple
CPU Warning Monitoron XSSXSXSXSXSX.com CPU Utilization Medium
CPU Warning Monitoron XSSXSXSXSXSX.com CPU Utilization Medium
CPU Warning Monitoron XSSXSXSXSXSX.com CPU Utilization Medium
CPU Warning Monitoron XSSXSXSXSXSX.com CPU Utilization Medium
Device Performance Alerts was triggered on Physical memory Memory Utilization Medium
Device Performance Alerts was triggered on Physical memory Memory Utilization Medium
Device Performance Alerts was triggered on Physical memory Memory Utilization Medium
Disk Space Is Lowon ;E: Disk Space Utilization Medium
Disk Space Is Lowon;C: Disk Space Utilization Medium
Network Interface Down Interface Down Complex
I tried below code - But in the output dataframe I can see blank []
not sure what I am missing. In output complexity column i can see just [] for each row. I have tried to get exact match but i need to get all the possible combinations so I am using get_close_matches. How to pass possibility argument which is in dataframe in below code, I am not figure out the way to pass the possibility.
I have tried few other possibilities like exact but that has not given result as expected as I am looking for the all possible combinations while comparing the columns with the string
import pandas as pd
import difflib
df1 = pd.read_csv('csv1.csv')
df1 = df1.fillna('')
df2 = pd.read_csv('reference.csv')
my_dict = dict(zip(df2['Category'].values, df2['Complexity'].values))
def match_key(key, default_value):
if not key:
return default_value
for d_key in my_dict.keys():
if key in d_key or d_key in key:
return my_dict[d_key]
return default_value
df1['Complexity'] = df1['Category'].apply(lambda x: difflib.get_close_matches(x, list(my_dict.keys(), n=1)))
df1 = df1.explode('Complexity')
df1['Complexity'] = df1['Complexity'].map(my_dict)
print(df1)
difflib.get_close_matches
expects the first argument to be the 'word', x
in your case, and the second argument to be the 'possibilities'. Which you've provided as an empty string. This is why your function isn't working, it's trying to match a word with nothing basically.
my_dict
contains the valid options as keys, so we can use them as the list of 'possibilities'
# Use n=1, so only tries to get 1 match
df1['Complexity'] = df1['Category'].apply(lambda x: difflib.get_close_matches(x, list(my_dict.keys()), n=1))
# The output of get_close_matches is a list, we use explode to convert it to a string
df1 = df1.explode('Complexity')
# We can now apply our map, to the *Complexity* column,
# which is technically the best match *Category*, via get_close_matches
df1['Complexity'] = df1['Complexity'].map(my_dict)
Original Bad Answer
However, rather than continue using the difflib
, I think you can change your approach. You want to apply my_dict
to the Category
column of df1
. This is traditionally known as applying a map
. pandas
has this implementation ready via pandas.Series.map
.
df1['Complexity'] = df1['Category'].map(my_dict)