I'm taking on my very first machine learning problem, and I'm struggling with cleaning my categorical features in my dataset. My goal is to build a rock climbing recommendation system.
PROBLEM 1:
I have three columns related columns that have erroneous information:
What it looks like now:
What I want it to look like:
If you groupby the location name, there are different location_id numbers and countries associated with that one name. However, there is a clear winner/clear majority to each of these discrepancies. I have a data set of 2 million entries and the mode of the location_id & location_country GIVEN the location_name is overwhelming pointing to one answer (example: "300" & "USA" for clear_creek).
Using pandas/python, how do I group my dataset by the location_name, compute the mode of location_id & location_country based on that location name, and then replace the entire id & country columns with these mode calculations based on location_name to clean up my data?
I've played around with groupby, replace, duplicated, but I think ultimately I will need to create a function that will do this, and I honestly have no idea where to start. (I apologize in advance for my coding naivety)I know there's got to be a solution, I just need to be pointed in the right direction.
PROBLEM 2:
Also, any one have suggestions on filling in NaN values in my location_name category (42,012/2 million) and location_country(46,890/2 million) columns? Is it best to keep as an unknown value? I feel like filling in these features based on frequency would be a horrible bias to my data set.
data = {'index': [1,2,3,4,5,6,7,8,9],
'location_name': ['kalaymous', 'kalaymous', 'kalaymous', 'kalaymous',
'clear_creek', 'clear_creek', 'clear_creek',
'clear_creek', 'clear_creek'],
'location_id': [100,100,0,100,300,625,300,300,300],
'location_country': ['GRC', 'GRC', 'ESP', 'GRC', 'USA', 'IRE',
'USA', 'USA', 'USA']}
df = pd.DataFrame.from_dict(data)
***looking for it to return:
improved_data = {'index': [1,2,3,4,5,6,7,8,9],
'location_name': ['kalaymous', 'kalaymous', 'kalaymous', 'kalaymous',
'clear_creek', 'clear_creek', 'clear_creek',
'clear_creek', 'clear_creek'],
'location_id': [100,100,100,100,300,300,300,300,300],
'location_country': ['GRC', 'GRC', 'GRC', 'GRC', 'USA', 'USA',
'USA', 'USA', 'USA']}
new_df = pd.DataFrame.from_dict(improved_data)
You can use transform
by calculating mode using df.iat[]
:
df=(df[['location_name']].join(df.groupby('location_name').transform(lambda x: x.mode()
.iat[0])).reindex(df.columns,axis=1))
print(df)
index location_name location_id location_country
0 1 kalaymous 100 GRC
1 1 kalaymous 100 GRC
2 1 kalaymous 100 GRC
3 1 kalaymous 100 GRC
4 5 clear_creek 300 USA
5 5 clear_creek 300 USA
6 5 clear_creek 300 USA
7 5 clear_creek 300 USA
8 5 clear_creek 300 USA