Search code examples
pythonpandasdataframegroup-by

Creating a new DataFrame column from application of function to multiple columns in groupby


I have a DataFrame of population counts by combination of categorical demographic features and date, with some missing values (consistent across all combos) per date constituting gaps in the data.

I am attempting to:

  1. group by all demographic features
  2. apply a function to each time series of population counts per demographic group (this requires manipulating both the date column and the population column)
  3. create a new column in the original (ungrouped) DataFrame based on this function

The function in (2) acts on the existing population column with missing values, redistributing post-gap counts backwards over the gap. I believe the function works as intended, but I am struggling to stitch it into the context of the group-by and turn that into a new column in the DataFrame.

Here is sample data:

          age          race     gender         date       population
0       15-24          AAPI       Male   2020-01-01              1.0   
1       15-24          AAPI       Male   2020-01-02              2.0
2       15-24          AAPI       Male   2020-01-03              2.0
... 
7       15-24         Black     Female   2020-01-01              0.0   
8       15-24         Black     Female   2020-01-02              NaN
9       15-24         Black     Female   2020-01-03              3.0     

For the above trivial example, the desired output would be:

          age          race     gender         date       population   interpolated
0       15-24          AAPI       Male   2020-01-01              1.0            1.0   
1       15-24          AAPI       Male   2020-01-02              2.0            2.0
2       15-24          AAPI       Male   2020-01-03              2.0            2.0
... 
7       15-24         Black     Female   2020-01-01              0.0            0.0   
8       15-24         Black     Female   2020-01-02              NaN            1.5
9       15-24         Black     Female   2020-01-03              3.0            1.5     

I have created the following function, which takes an input list of date gaps:

gaps = [
    {
        "gap": [2020-01-02],
        "day_after": 2020-01-03,
    }
]

def bfill_pop(gaps, group):
    for el in gaps:
        fill_val = group.loc[group["date"] == el["day_after"], "population"] / (
            len(el["gap"]) + 1
        )
        group.loc[group["date"].isin(el["gap"]), "population"] = fill_val
        group.loc[group["date"] == el["day_after"], "population"] = fill_val
    return group.rename(columns={"population": "interpolated"})["interpolated"]

When I try to apply this to the DataFrame with apply() or transform() functions though, I get errors, e.g.:

df["interpolated"] = df.groupby(["age", "race", "gender"]).apply(
    lambda g: bfill_pop(gaps, g)
)
> ValueError: cannot handle a non-unique multi-index!

Is there a way to do this with apply or transform functions?


Solution

  • You were nearly there. It was simply a miss-handling of indices in the function:

    import pandas as pd
    
    def bfill_pop(gaps, group):
        for el in gaps:
            day_after_population = group.loc[group['date'] == pd.to_datetime(el['day_after']), 'population']
            if not day_after_population.empty:
                fill_val = day_after_population.iloc[0] / (len(el['gap']) + 1)
                group.loc[group['date'].isin([pd.to_datetime(date) for date in el['gap']]), 'population'] = fill_val
                group.loc[group['date'] == pd.to_datetime(el['day_after']), 'population'] = fill_val
        return group
    
    data = {
        'age': ['15-24'] * 6,
        'race': ['AAPI', 'AAPI', 'Black', 'Black', 'Black', 'Black'],
        'gender': ['Male', 'Male', 'Female', 'Female', 'Female', 'Female'],
        'date': pd.to_datetime(['2020-01-01', '2020-01-02', '2020-01-01', '2020-01-02', '2020-01-02', '2020-01-03']),
        'population': [1.0, 2.0, 0.0, np.nan, np.nan, 3.0]
    }
    df = pd.DataFrame(data)
    
    gaps = [
        {
            "gap": ['2020-01-02'],
            "day_after": '2020-01-03',
        }
    ]
    
    df['interpolated'] = df.groupby(['age', 'race', 'gender']).apply(
        lambda g: bfill_pop(gaps, g)
    ).reset_index(drop=True)['population']  
    
    print(df)
    
    

    which gives you

         age   race  gender       date  population  interpolated
    0  15-24   AAPI    Male 2020-01-01         1.0           1.0
    1  15-24   AAPI    Male 2020-01-02         2.0           2.0
    2  15-24  Black  Female 2020-01-01         0.0           0.0
    3  15-24  Black  Female 2020-01-02         NaN           1.5
    4  15-24  Black  Female 2020-01-02         NaN           1.5
    5  15-24  Black  Female 2020-01-03         3.0           1.5