Search code examples
pythonpython-3.xpandaspandas-groupbysklearn-pandas

Cleaning up & filling in categorical variables for Data Science analysis


I'm taking on my very first machine learning problem, and I'm struggling with cleaning my categorical features in my dataset. My goal is to build a rock climbing recommendation system.

PROBLEM 1:

I have three columns related columns that have erroneous information:

What it looks like now: Example Table

What I want it to look like: Would like my table to look like this...

If you groupby the location name, there are different location_id numbers and countries associated with that one name. However, there is a clear winner/clear majority to each of these discrepancies. I have a data set of 2 million entries and the mode of the location_id & location_country GIVEN the location_name is overwhelming pointing to one answer (example: "300" & "USA" for clear_creek).

Using pandas/python, how do I group my dataset by the location_name, compute the mode of location_id & location_country based on that location name, and then replace the entire id & country columns with these mode calculations based on location_name to clean up my data?

I've played around with groupby, replace, duplicated, but I think ultimately I will need to create a function that will do this, and I honestly have no idea where to start. (I apologize in advance for my coding naivety)I know there's got to be a solution, I just need to be pointed in the right direction.

PROBLEM 2:

Also, any one have suggestions on filling in NaN values in my location_name category (42,012/2 million) and location_country(46,890/2 million) columns? Is it best to keep as an unknown value? I feel like filling in these features based on frequency would be a horrible bias to my data set.

data = {'index': [1,2,3,4,5,6,7,8,9], 
        'location_name': ['kalaymous', 'kalaymous', 'kalaymous', 'kalaymous',
                          'clear_creek', 'clear_creek', 'clear_creek', 
                          'clear_creek', 'clear_creek'],
        'location_id': [100,100,0,100,300,625,300,300,300], 
        'location_country': ['GRC', 'GRC', 'ESP', 'GRC', 'USA', 'IRE', 
                             'USA', 'USA', 'USA']}
df = pd.DataFrame.from_dict(data)

***looking for it to return:

improved_data = {'index': [1,2,3,4,5,6,7,8,9], 
            'location_name': ['kalaymous', 'kalaymous', 'kalaymous', 'kalaymous',
                              'clear_creek', 'clear_creek', 'clear_creek', 
                              'clear_creek', 'clear_creek'],
            'location_id': [100,100,100,100,300,300,300,300,300], 
            'location_country': ['GRC', 'GRC', 'GRC', 'GRC', 'USA', 'USA', 
                                 'USA', 'USA', 'USA']}

new_df = pd.DataFrame.from_dict(improved_data)

Solution

  • You can use transform by calculating mode using df.iat[]:

    df=(df[['location_name']].join(df.groupby('location_name').transform(lambda x: x.mode()
                                                       .iat[0])).reindex(df.columns,axis=1))
    print(df)
    

      index location_name location_id location_country
    0     1     kalaymous         100              GRC
    1     1     kalaymous         100              GRC
    2     1     kalaymous         100              GRC
    3     1     kalaymous         100              GRC
    4     5   clear_creek         300              USA
    5     5   clear_creek         300              USA
    6     5   clear_creek         300              USA
    7     5   clear_creek         300              USA
    8     5   clear_creek         300              USA