Search code examples
pythonpandaspandas-groupbymeanfillna

pandas groupby fillna code does not work and gives error


I have the dataframe like this:

data = {'name':  ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry'],
        'job': ['teacher', 'doctor', 'engineer','teacher', 'doctor', 'engineer'],
        'age': [27, 32, 78,27, 32, 78],
        'weight': [160, 209, 130,164, 206, 132],
        'date': ['6-12-2022', '6-12-2022', '6-12-2022','6-13-2022', '6-13-2022', '6-13-2022']
        }

df = pd.DataFrame(data) df

I add data for another date as null values:

    |name   |job        |age|weight |date
|---|-------|-----------|---|-------|--------
|0  |Alex   |teacher    |27 |160    |6-12-2022
|1  |Ben    |doctor     |32 |209    |6-12-2022
|2  |Marry  |engineer   |78 |130    |6-12-2022
|3  |Alex   |teacher    |27 |164    |6-13-2022
|4  |Ben    |doctor     |32 |206    |6-13-2022
|5  |Marry  |engineer   |78 |132    |6-13-2022
|6  |Alex   |teacher    |NaN|NaN    |6-14-2022
|7  |Ben    |doctor     |NaN|NaN    |6-14-2022
|8  |Marry  |engineer   |NaN|NaN    |6-14-2022

Now, I want to fillna values in weight column after groupby name and job, and get average

I tried these lines of codes:

df.loc[df.weight.isnull(), 'weight'] = df.groupby(["name", "job"]).weight.transform('mean')

or

df['weight'] = df['weight'].fillna(df.groupby(["name", "job"])['weight'].mean())

or

df['weight'] = df.groupby(["name", "job"], sort=False)['weight'].apply(lambda x: x.fillna(x.mean()))

For the first one, I get this error:

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

name and job are string, and weight is flout64.

and for the rest it runs but doesnt fill na values.


Solution

  • For your purpose using Andrej Kesely answer might be enough, but have in mind that using apply with pandas is not good performance-wise.

    A better option to maintain performance is to use .transform('mean') like

      df["age"].fillna(df.groupby(["name", "job"])["age"].transform("mean"), inplace=True) 
      df["weight"].fillna(df.groupby(["name", "job"])["weight"].transform("mean"), inplace=True) 
    

    This is just a comparison between the groupby and apply strategy and the transform one. Where slow referes to the groupby-apply and fast to the transform.

    enter image description here

    To reproduce this comparison you can run this

    import numpy as np
    import pandas as pd
    import time
    from tqdm import tqdm
    
    def build_dataset(N):
      names = ['Alex', 'Ben', 'Marry','Alex', 'Ben', 'Marry']
      jobs = ['teacher', 'doctor', 'engineer','teacher', 'doctor', 'engineer']
    
      data = {
          'name':  np.random.choice(names, size=N),
          'job': np.random.choice(jobs, size=N),
          'age': np.random.uniform(low=10, high=90, size=N),
          'weight': np.random.uniform(low=60, high=150, size=N)
      }
    
      df = pd.DataFrame(data)
      df.loc[df.sample(frac=0.1).index, "age"] = np.nan
      df.loc[df.sample(frac=0.1).index, "weight"] = np.nan
      return df
    
    def slow_way(df):
      df = df.copy()
      def fn(x):
          x["age"].fillna(x["age"].mean(), inplace=True)
          x["weight"].fillna(x["weight"].mean(), inplace=True)
          return x
    
    
      return df.groupby(["name", "job"]).apply(fn)
    
    def fast_way(df):
      df = df.copy()
      df["age"].fillna(df.groupby(["name", "job"])["age"].transform("mean"), inplace=True) 
      df["weight"].fillna(df.groupby(["name", "job"])["weight"].transform("mean"), inplace=True) 
    
      return df
    
    Ns = np.arange(10, 100000, step=1000, dtype=np.int32)
    slow = []
    fast = []
    size = []
    for N in tqdm(Ns):
      for i in range(10):
        df = build_dataset(N)
        start = time.time()
        _ = slow_way(df)
        end = time.time()
        slow.append(end-start)
        start = time.time()
        _ = fast_way(df)
        end = time.time()
        fast.append(end-start)
        size.append(N)
    
    df = pd.DataFrame({"N": size, "slow": slow, "fast": fast})
    df_group = df.groupby("N").mean()
    df_group.plot(figsize=(30,10))