I have the following code:
df.groupby(["id", "year"], as_index=False).agg({"brand":"first", "color":"first"})
However, I have some values that are NaN. I want to select the first value that is not NaN.
Suppose my dataframe looks like this:
id | year | brand | color |
---|---|---|---|
001 | 2010 | NaN | Blue |
001 | 2010 | Audi | NaN |
001 | 2010 | Audo | Blue |
001 | 2011 | Bmw | NaN |
001 | 2011 | NaN | NaN |
001 | 2012 | BMW | Green |
002 | 2010 | Tesla | White |
I want to find all unique combinations of id and year,i.e. df.groupby(["id", "year"])
and then find the first non-NaN value. The motivation behind this is that a have a large and messy data set with many missing values and many typos. In the example table I also simulated a typo. Note that it is irrelevant whether the typo is first and gets chosen, as long as I keep track of the data per combination of id and year. Typos are a completely separate problem for now.
The desired output would be:
id | year | brand | color |
---|---|---|---|
001 | 2010 | Audi | Blue |
001 | 2011 | BMW | NaN |
001 | 2012 | BMW | Green |
002 | 2010 | Tesla | White |
This approach will produce the below output.
df1 = df.groupby(['id', 'year']).agg({
'brand': lambda x: x.dropna().iloc[0] if x.dropna().any() else np.nan,
'color': lambda x: x.dropna().iloc[0] if x.dropna().any() else np.nan,
}).reset_index()
print(df1)
Using a costume function:
def get_first_non_nan(x):
return x.dropna().iloc[0] if x.dropna().any() else np.nan
df1 = df.groupby(['id', 'year']).agg({
'brand': get_first_non_nan,
'color': get_first_non_nan,
}).reset_index()
print(df1)
id year brand color
0 001 2010 Audi Blue
1 001 2011 Bmw NaN
2 001 2012 BMW Green
3 002 2010 Tesla White