Search code examples
pythonpandasdataframegroup-bymissing-data

Fill In Missing Values With Groupby


I'm trying to fill in missing values after groupby two columns on the planets dataset.

# Load data
df = sns.load_dataset('planets')

# Check naan
df.isna().sum()

method              0
number              0
orbital_period     43
mass              522
distance          227
year                0
dtype: int64

However, after filling in the missing values with group mean, missing values still remain. I'm not sure why this is happening (I tried this on the titanic dataset and it completely works there)). Even if I try to fill by each column (no for loop), the problem still shows up.

# Select naan column names
null_cols = df.columns[df.isnull().any()]

# Fill in with loop
for col in null_cols:
  df[col] = df.groupby(['method', 'year'])[col].transform(lambda x: x.fillna(x.mean()))

# Check naan values again
df.isna().sum()

method              0
number              0
orbital_period     28
mass              405
distance           26
year                0

What's wrong here? Any suggestions would be appreciated. Thanks!


Solution

  • The reason this is happening is that all of the values for some of the groups you're generating do not have a single non-nan value.

    Take the value/col mass for the group ('Microlensing', 2012) it has 6 entries of which there are 0 non-nan values. If there are no actual values to take the mean of you can't really calculate a mean which can be used for imputing the other nan-values in the same group.

    Here is the debug code I used:

    import math
    import seaborn as sns
    
    df = sns.load_dataset("planets")
    
    print(df.isna().sum())
    
    null_cols = df.columns[df.isnull().any()]
    
    
    def inspect_fillna(x):
        mean_x = x.mean()
        if math.isnan(mean_x):
            print("group=", x.name, ", entries=", len(x), ", all_are_nan=", len(x) == x.isna().sum(), sep="")
        imputed_x = x.fillna(mean_x)
        return imputed_x
    
    
    for col in null_cols:
        print("\n\ncol=", col, sep="")
        df[col] = df.groupby(["method", "year"])[col].transform(lambda x: inspect_fillna(x))
    
    print(df.isna().sum())
    

    Here is the output:

    method              0
    number              0
    orbital_period     43
    mass              522
    distance          227
    year                0
    dtype: int64
    
    
    col=orbital_period
    group=('Imaging', 2004), entries=3, all_are_nan=True
    group=('Imaging', 2005), entries=1, all_are_nan=True
    group=('Imaging', 2007), entries=1, all_are_nan=True
    group=('Imaging', 2012), entries=2, all_are_nan=True
    group=('Imaging', 2013), entries=7, all_are_nan=True
    group=('Microlensing', 2004), entries=1, all_are_nan=True
    group=('Microlensing', 2009), entries=2, all_are_nan=True
    group=('Microlensing', 2012), entries=6, all_are_nan=True
    group=('Microlensing', 2013), entries=4, all_are_nan=True
    group=('Transit Timing Variations', 2014), entries=1, all_are_nan=True
    
    
    col=mass
    group=('Astrometry', 2010), entries=1, all_are_nan=True
    group=('Astrometry', 2013), entries=1, all_are_nan=True
    group=('Eclipse Timing Variations', 2008), entries=2, all_are_nan=True
    group=('Eclipse Timing Variations', 2010), entries=2, all_are_nan=True
    group=('Eclipse Timing Variations', 2011), entries=3, all_are_nan=True
    group=('Imaging', 2004), entries=3, all_are_nan=True
    group=('Imaging', 2005), entries=1, all_are_nan=True
    group=('Imaging', 2006), entries=4, all_are_nan=True
    group=('Imaging', 2007), entries=1, all_are_nan=True
    group=('Imaging', 2008), entries=8, all_are_nan=True
    group=('Imaging', 2009), entries=3, all_are_nan=True
    group=('Imaging', 2010), entries=6, all_are_nan=True
    group=('Imaging', 2011), entries=3, all_are_nan=True
    group=('Imaging', 2012), entries=2, all_are_nan=True
    group=('Imaging', 2013), entries=7, all_are_nan=True
    group=('Microlensing', 2004), entries=1, all_are_nan=True
    group=('Microlensing', 2005), entries=2, all_are_nan=True
    group=('Microlensing', 2006), entries=1, all_are_nan=True
    group=('Microlensing', 2008), entries=4, all_are_nan=True
    group=('Microlensing', 2009), entries=2, all_are_nan=True
    group=('Microlensing', 2010), entries=2, all_are_nan=True
    group=('Microlensing', 2011), entries=1, all_are_nan=True
    group=('Microlensing', 2012), entries=6, all_are_nan=True
    group=('Microlensing', 2013), entries=4, all_are_nan=True
    group=('Orbital Brightness Modulation', 2011), entries=2, all_are_nan=True
    group=('Orbital Brightness Modulation', 2013), entries=1, all_are_nan=True
    group=('Pulsar Timing', 1992), entries=2, all_are_nan=True
    group=('Pulsar Timing', 1994), entries=1, all_are_nan=True
    group=('Pulsar Timing', 2003), entries=1, all_are_nan=True
    group=('Pulsar Timing', 2011), entries=1, all_are_nan=True
    group=('Pulsation Timing Variations', 2007), entries=1, all_are_nan=True
    group=('Transit', 2002), entries=1, all_are_nan=True
    group=('Transit', 2004), entries=5, all_are_nan=True
    group=('Transit', 2006), entries=5, all_are_nan=True
    group=('Transit', 2007), entries=16, all_are_nan=True
    group=('Transit', 2008), entries=17, all_are_nan=True
    group=('Transit', 2009), entries=18, all_are_nan=True
    group=('Transit', 2010), entries=48, all_are_nan=True
    group=('Transit', 2011), entries=80, all_are_nan=True
    group=('Transit', 2012), entries=92, all_are_nan=True
    group=('Transit', 2014), entries=40, all_are_nan=True
    group=('Transit Timing Variations', 2011), entries=1, all_are_nan=True
    group=('Transit Timing Variations', 2012), entries=1, all_are_nan=True
    group=('Transit Timing Variations', 2013), entries=1, all_are_nan=True
    group=('Transit Timing Variations', 2014), entries=1, all_are_nan=True
    
    
    col=distance
    group=('Eclipse Timing Variations', 2009), entries=1, all_are_nan=True
    group=('Eclipse Timing Variations', 2011), entries=3, all_are_nan=True
    group=('Eclipse Timing Variations', 2012), entries=1, all_are_nan=True
    group=('Microlensing', 2004), entries=1, all_are_nan=True
    group=('Microlensing', 2005), entries=2, all_are_nan=True
    group=('Microlensing', 2006), entries=1, all_are_nan=True
    group=('Microlensing', 2008), entries=4, all_are_nan=True
    group=('Microlensing', 2009), entries=2, all_are_nan=True
    group=('Microlensing', 2010), entries=2, all_are_nan=True
    group=('Microlensing', 2011), entries=1, all_are_nan=True
    group=('Orbital Brightness Modulation', 2013), entries=1, all_are_nan=True
    group=('Pulsar Timing', 1992), entries=2, all_are_nan=True
    group=('Pulsar Timing', 1994), entries=1, all_are_nan=True
    group=('Pulsar Timing', 2003), entries=1, all_are_nan=True
    group=('Pulsation Timing Variations', 2007), entries=1, all_are_nan=True
    group=('Transit', 2002), entries=1, all_are_nan=True
    group=('Transit Timing Variations', 2014), entries=1, all_are_nan=True
    method              0
    number              0
    orbital_period     28
    mass              405
    distance           26
    year                0
    dtype: int64
    

    Possible solution: Consider making your groups larger by removing year or method from your group.