I am preprocessing data received on a survey through a .csv file. This column contains the course name the students have taken. Since this has been typed out by them, there are different ways the same course name has been spelled out. For e.g.: the course name 'B.A. L.L.B.' has been typed out like 'Ballb' or 'bal.l.b.' etc. I've tried the most basic brute force approach I could think of where I take all the options in an if statement and replace them with the generic course spelling but I'm still getting tonnes of values the program has not been able to group into one of those statements. Is there a faster way to group them together?
def get_course_name(x):
if 'B.E' in x or 'B.E.' in x or 'BE' in x or 'B.E(cse)' in x or 'Bachelor Of Engineering' in x or 'BECSE' in x or 'Be' in x:
return 'B.E.'
if 'L.L.B.' in x or 'Ballb(h)' in x or 'Ballb' in x:
return 'B.A. LLB'
if 'B.Tech' in x or 'B.TECH' in x or 'B.tech' in x or 'B .Tech' in x or 'Btech' in x or 'BTech' in x or 'B-tech' in x or 'B.Tech.' in x or 'CSE' in x or 'Biotechnology' in x or 'Biotech' in x:
return 'B. Tech'
if 'B.pharmacy' in x or 'B. Pharmacy' in x or 'B pharma' in x or 'pharmacy' in x or 'B.Pharmacy' in x or 'M.pharmacy' in x or 'B.Pharm' in x or 'Pharma' in x or 'pharm' in x or 'Pharmacy' in x or 'B.pharma' in x or 'B-pharmacy' in x:
return 'B. Pharma'
if 'BBA' in x or 'bba' in x:
return 'BBA'
if 'MBA' in x or 'mba' in x or 'Mba' in x or 'MBA ' in x:
return 'MBA'
if 'M.Tech' in x or 'M. Tech' in x or 'mtech' in x or 'm.tech' in x or 'M-tech' in x or 'Mtec-EE' in x:
return 'M. Tech'
if 'MBBS' in x or 'mbbs' in x:
return 'MBBS'
if 'B.Sc' in x or 'B. Sc' in x or 'Bsc.' in x or 'B.S.c' in x:
return 'B. Science'
if 'msc' in x or 'M.Sc' in x or 'M. Sc' in x or 'Msc' in x or 'MSc' in x or 'm.sc' in x:
return 'M. Science'
return 'misc'
And this is where I call the function to get the value counts of each course:
df1['Course Name'] = df1['Course Name'].apply(get_course_name)
df1['Course Name'].value_counts()
This is what the dataframe looks like
The column I'm trying to group is called 'Course Name'.
Let's try to build a mapper:
import pandas as pd
src_df = pd.DataFrame({'Course Name': ['B.E.', 'ME CSE', 'English Literature',
'Bsc. Economics Honrs.', 'BSC nursing'],
'Course Year': ['Fourth', 'Second', 'First',
'Second', "Fourth"]})
# Define Aliases Here (Desired Format on Left, Options on Right)
aliases = {
'B.E.': ['B.E', 'B.E.', 'BE', 'B.E(cse)',
'Bachelor Of Engineering', 'BECSE', 'Be'],
'B. Science': ['B.Sc', 'B. Sc', 'Bsc.', 'B.S.c']
}
# Generate Mapper from aliases
mapper = {alias: new_code for new_code, lst in aliases.items() for alias in lst}
# Apply Mapper to every Course Name
src_df['Course Name'] = src_df['Course Name'] \
.apply(lambda x: pd.Series(map(mapper.get,
filter(lambda v: v in x, mapper)))) \
.fillna('misc')
# For Display
print(src_df.to_string())
Output:
Course Name Course Year 0 B.E. Fourth 1 misc Second 2 misc First 3 B. Science Second 4 misc Fourth
mapper = {alias: new_code for new_code, lst in aliases.items() for alias in lst}
Builds a dictionary based on the defined aliases
above. The aliases
list is just there since it is more readable then the mapper.
Mapper:
{'B.E': 'B.E.', 'B.E.': 'B.E.',
'BE': 'B.E.', 'B.E(cse)': 'B.E.',
'Bachelor Of Engineering': 'B.E.',
'BECSE': 'B.E.', 'Be': 'B.E.',
'B.Sc': 'B. Science',
'B. Sc': 'B. Science', 'Bsc.': 'B. Science',
'B.S.c': 'B. Science'}
*Note mapper assumes all aliases are unique and that no one alias is used for multiple cases.
After this, test every Course name to see if the string is contained in the dictionary
(Ref. Python: Check if a key in a dictionary is contained in a string):
src_df['Course Name'] = src_df['Course Name'] \
.apply(lambda x: pd.Series(map(mapper.get,
filter(lambda v: v in x, mapper))))
print(src_df)
src_df:
Course Name Course Year 0 B.E. Fourth 1 NaN Second 2 NaN First 3 B. Science Second 4 NaN Fourth
Then go back through and fill in default case:
.fillna('misc')
Which will replace the un-mapped rows with the default value 'misc'