I'm trying to break a DataFrame into four parts and to impute rounded mean values for each part using fillna()
. I have two columns, main_campus
and degree_type
I want to filter on, which have two unique values each. So between them I should be able to filter the DataFrame into two groups.
I first did this with a triple for loop (see below), which seems to work, but when I tried to do it in a more elegant way, I got a SettingWithCopy
warning that I couldn't fix by using .loc
or .copy()
, and the missing values wouldn't be filled even when inplace
was set to True
. Here's the code for the latter method:
# Imputing mean values for main campus BA students
df[(df.main_campus == 1) &
(df.degree_type == 'BA')] = df[(df.main_campus == 1) &
(df.degree_type == 'BA')].fillna(
df[(nulled_data.main_campus == 1) &
(df.degree_type == 'BA')
].mean(),
inplace=True)
# Imputing mean values for main campus BS students
df[(df.main_campus == 1) &
(df.degree_type == 'BS')] = df[(df.main_campus == 1) &
(df.degree_type == 'BS')].fillna(
df[(df.main_campus == 1) &
(df.degree_type == 'BS')
].mean(),
inplace=True)
# Imputing mean values for downtown campus BA students
df[(df.main_campus == 0) &
(df.degree_type == 'BA')] = df[(df.main_campus == 0) &
(df.degree_type == 'BA')].fillna(
df[(df.main_campus == 0) &
(df.degree_type == 'BA')
].mean(),
inplace=True)
# Imputing mean values for downtown campus BS students
df[(df.main_campus == 0) &
(df.degree_type == 'BS')] = df[(df.main_campus == 0) &
(df.degree_type == 'BS')].fillna(
df[(df.main_campus == 0) &
(df.degree_type == 'BS')
].mean(),
inplace=True)
I should mention the previous code went through several iterations, trying it without setting it back to the slice, with and without inplace
, etc.
Here's the code with the triple for loop that works:
imputation_cols = [# all the columns I want to impute]
for col in imputation_cols:
for i in [1, 0]:
for path in ['BA', 'BS']:
group = ndf.loc[((df.main_campus == i) &
(df.degree_type == path)), :]
group = group.fillna(value=round(group.mean()))
df.loc[((df.main_campus == i) &
(df.degree_type == path)), :] = group
It's worth mentioning that I think the use of the group
variable in the triple for loop code is also to help the filled NaN values actually get set back to the DataFrame, but I would need to double check this.
Does anyone have an idea for what's going on here?
A good way to approach such a problem is to simplify your code. Simplifying your code makes it easier to find the source of the warning:
group1 = (df.main_campus == 1) & (df.degree_type == 'BA')
group2 = (df.main_campus == 1) & (df.degree_type == 'BS')
group3 = (df.main_campus == 0) & (df.degree_type == 'BA')
group4 = (df.main_campus == 0) & (df.degree_type == 'BS')
# Imputing mean values for main campus BA students
df.loc[group1, :] = df.loc[group1, :].fillna(df.loc[group1, :].mean()) # repeat for other groups
Now you can see the problem more clearly. You are trying to write the mean of the df back to the df. Pandas issues a warning because the slice you use to compute the mean could be inconsistent with the changed dataframe. In your case it produces the correct result. But the consistency of your dataframe is at risk.
You could solve this by computing the mean beforehand:
group1_mean = df.loc[group1, :].mean()
df.loc[group1, :] = df.loc[group1, :].fillna(group1_mean)
In my opinion this makes the code more clear. But you still have four groups (group1, group2, ...). A clear sign to use a loop:
from itertools import product
for campus, degree in product([1, 0], ['BS', 'BA']):
group = (df.main_campus == campus) & (df.degree_type == degree)
group_mean = df.loc[group, :].mean()
df.loc[group, :] = df.loc[group, :].fillna(group_mean)
I have used product
from itertools to get rid of the ugly nested loop. It is quite similar to your "inelegant" first solution. So you were almost there the first time.
We ended up with four lines of code and a loop. I am sure with some pandas magic you could convert it to one line. However, you will still understand these four lines in a week or a month or a year from now. Also, other people reading your code will understand it easily. Readability counts.
Disclaimer: I could not test the code since you did not provide a sample dataframe. So my code may throw an error because of a typo. A minimal reproducible example makes it so much easier to answer questions. Please consider this the next time you post a question on SO.