Search code examples
pythonpandasdata-analysisdata-preprocessing

How do I categorically group responses typed out differently in a survey into common groups to process the data easier?


I am preprocessing data received on a survey through a .csv file. This column contains the course name the students have taken. Since this has been typed out by them, there are different ways the same course name has been spelled out. For e.g.: the course name 'B.A. L.L.B.' has been typed out like 'Ballb' or 'bal.l.b.' etc. I've tried the most basic brute force approach I could think of where I take all the options in an if statement and replace them with the generic course spelling but I'm still getting tonnes of values the program has not been able to group into one of those statements. Is there a faster way to group them together?

def get_course_name(x):
if 'B.E' in x or 'B.E.' in x or 'BE' in x or 'B.E(cse)' in x or 'Bachelor Of Engineering' in x or 'BECSE' in x or 'Be' in x:
    return 'B.E.'
if 'L.L.B.' in x or 'Ballb(h)' in x or 'Ballb' in x:
    return 'B.A. LLB'
if 'B.Tech' in x or 'B.TECH' in x or 'B.tech' in x or 'B .Tech' in x or 'Btech' in x or 'BTech' in x or 'B-tech' in x or 'B.Tech.' in x or 'CSE' in x or 'Biotechnology' in x or 'Biotech' in x:
    return 'B. Tech' 
if 'B.pharmacy' in x or 'B. Pharmacy' in x or 'B pharma' in x or 'pharmacy' in x or 'B.Pharmacy' in x or 'M.pharmacy' in x or 'B.Pharm' in x or 'Pharma' in x or 'pharm' in x or 'Pharmacy' in x or 'B.pharma' in x or 'B-pharmacy' in x:
    return 'B. Pharma'
if 'BBA' in x or 'bba' in x:
    return 'BBA'
if 'MBA' in x or 'mba' in x or 'Mba' in x or 'MBA ' in x:
    return 'MBA'
if 'M.Tech' in x or 'M. Tech' in x or 'mtech' in x or 'm.tech' in x or 'M-tech' in x or 'Mtec-EE' in x:
    return 'M. Tech'
if 'MBBS' in x or 'mbbs' in x:
    return 'MBBS'
if 'B.Sc' in x or 'B. Sc' in x or 'Bsc.' in x or 'B.S.c' in x:
    return 'B. Science'
if 'msc' in x or 'M.Sc' in x or 'M. Sc' in x or 'Msc' in x or 'MSc' in x or 'm.sc' in x:
    return 'M. Science'
return 'misc'

And this is where I call the function to get the value counts of each course:

df1['Course Name'] = df1['Course Name'].apply(get_course_name)
df1['Course Name'].value_counts()

This is what the dataframe looks like

The column I'm trying to group is called 'Course Name'.


Solution

  • Let's try to build a mapper:

    import pandas as pd
    
    src_df = pd.DataFrame({'Course Name': ['B.E.', 'ME CSE', 'English Literature',
                                           'Bsc. Economics Honrs.', 'BSC nursing'],
                           'Course Year': ['Fourth', 'Second', 'First',
                                           'Second', "Fourth"]})
    
    # Define Aliases Here (Desired Format on Left, Options on Right)
    aliases = {
        'B.E.': ['B.E', 'B.E.', 'BE', 'B.E(cse)',
                 'Bachelor Of Engineering', 'BECSE', 'Be'],
        'B. Science': ['B.Sc', 'B. Sc', 'Bsc.', 'B.S.c']
    }
    
    # Generate Mapper from aliases
    mapper = {alias: new_code for new_code, lst in aliases.items() for alias in lst}
    
    # Apply Mapper to every Course Name
    src_df['Course Name'] = src_df['Course Name'] \
        .apply(lambda x: pd.Series(map(mapper.get,
                                       filter(lambda v: v in x, mapper)))) \
        .fillna('misc')
    
    # For Display
    print(src_df.to_string())
    

    Output:

      Course Name Course Year
    0        B.E.      Fourth
    1        misc      Second
    2        misc       First
    3  B. Science      Second
    4        misc      Fourth
    

    mapper = {alias: new_code for new_code, lst in aliases.items() for alias in lst}
    

    Builds a dictionary based on the defined aliases above. The aliases list is just there since it is more readable then the mapper.

    Mapper:

    {'B.E': 'B.E.', 'B.E.': 'B.E.',
     'BE': 'B.E.', 'B.E(cse)': 'B.E.',
     'Bachelor Of Engineering': 'B.E.',
     'BECSE': 'B.E.', 'Be': 'B.E.',
     'B.Sc': 'B. Science',
     'B. Sc': 'B. Science', 'Bsc.': 'B. Science',
     'B.S.c': 'B. Science'}
    

    *Note mapper assumes all aliases are unique and that no one alias is used for multiple cases.


    After this, test every Course name to see if the string is contained in the dictionary

    (Ref. Python: Check if a key in a dictionary is contained in a string):

    src_df['Course Name'] = src_df['Course Name'] \
        .apply(lambda x: pd.Series(map(mapper.get,
                                       filter(lambda v: v in x, mapper))))
    print(src_df)
    

    src_df:

      Course Name Course Year
    0        B.E.      Fourth
    1         NaN      Second
    2         NaN       First
    3  B. Science      Second
    4         NaN      Fourth
    

    Then go back through and fill in default case:

    .fillna('misc')
    

    Which will replace the un-mapped rows with the default value 'misc'