Search code examples
pythonpandasdataframepandas-groupbyfillna

Pandas deleting cells when filling groups with NaN in groupby column


I am trying to fill in empty rows in a DataFrame by propagating known values to values within the same group. This seems to work fine when all the columns grouped on are full, but if there are empty cells in a column, Pandas will delete values. In the provided example, the DataFrame has NaN in column "B" at indices 6 and 8. After using ffill and bfill on the groups, the numbers in rows 6 and 8 have been replaced with NaN. How do I avoid this unintended side effect?

Code:

import pandas as pd
import numpy as np

df = pd.DataFrame({"A": [1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 5, 6, 6],
                    "B": ['a', 'a', 'b', 'b', 'b', 'c', np.nan, 'd', 
                          np.nan, 'e', 'e', 'g', 'h', 'h'],
                    "C": [5.0, np.nan, 4.0, 4.0, np.nan, 9.0, np.nan, 
                          np.nan, 9.0, 8.0, np.nan, 2.0, np.nan, 3.0],
                    "D": [1.0, 1.0, np.nan, 2.0, np.nan, np.nan, np.nan, 
                          np.nan, 3.0, 2.0, np.nan, 9.0, np.nan, 3.0],
                    "E": [np.nan, 6.0, np.nan, 3.0, np.nan, np.nan, 7.0, 
                          np.nan, 7.0, 2.0, np.nan, np.nan, np.nan, 0.0]})

cols_to_groupby = ["A", "B"]
cols_to_fill = ["C", "D", "E"]

original_indxs = df.index
df[cols_to_fill] = (
    df.sort_values(cols_to_groupby)[cols_to_fill + cols_to_groupby]
    .groupby(cols_to_groupby)
    .transform(lambda x: x.ffill().bfill())
)

Sample Dataset:

    A    B    C    D    E
0   1    a  5.0  1.0  NaN
1   1    a  NaN  1.0  6.0
2   2    b  4.0  NaN  NaN
3   2    b  4.0  2.0  3.0
4   2    b  NaN  NaN  NaN
5   3    c  9.0  NaN  NaN
6   3  NaN  NaN  NaN  7.0
7   3    d  NaN  NaN  NaN
8   3  NaN  9.0  3.0  7.0
9   4    e  8.0  2.0  2.0
10  4    e  NaN  NaN  NaN
11  5    g  2.0  9.0  NaN
12  6    h  NaN  NaN  NaN
13  6    h  3.0  3.0  0.0

Expected Result:

    A    B    C    D    E
0   1    a  5.0  1.0  6.0
1   1    a  5.0  1.0  6.0
2   2    b  4.0  2.0  3.0
3   2    b  4.0  2.0  3.0
4   2    b  4.0  2.0  3.0
5   3    c  9.0  NaN  NaN
6   3  NaN  NaN  NaN  7.0
7   3    d  NaN  NaN  NaN
8   3  NaN  9.0  3.0  7.0
9   4    e  8.0  2.0  2.0
10  4    e  8.0  2.0  2.0
11  5    g  2.0  9.0  NaN
12  6    h  3.0  3.0  0.0
13  6    h  3.0  3.0  0.0

Actual Result:

    A    B    C    D    E
0   1    a  5.0  1.0  6.0
1   1    a  5.0  1.0  6.0
2   2    b  4.0  2.0  3.0
3   2    b  4.0  2.0  3.0
4   2    b  4.0  2.0  3.0
5   3    c  9.0  NaN  NaN
6   3  NaN  NaN  NaN  NaN
7   3    d  NaN  NaN  NaN
8   3  NaN  NaN  NaN  NaN
9   4    e  8.0  2.0  2.0
10  4    e  8.0  2.0  2.0
11  5    g  2.0  9.0  NaN
12  6    h  3.0  3.0  0.0
13  6    h  3.0  3.0  0.0

Solution

  • By default groupby will ignore the rows where the group key columns contain NaN values, So you can not rely on index alignment (df[cols_to_fill] = ...) to update the dataframe here we have to explicitly update the original dataframe with the result from groupby

    cols = df.groupby(cols_to_groupby)[cols_to_fill].apply(lambda x: x.ffill().bfill())
    df.update(cols)
    

        A    B    C    D    E
    0   1    a  5.0  1.0  6.0
    1   1    a  5.0  1.0  6.0
    2   2    b  4.0  2.0  3.0
    3   2    b  4.0  2.0  3.0
    4   2    b  4.0  2.0  3.0
    5   3    c  9.0  NaN  NaN
    6   3  NaN  NaN  NaN  7.0
    7   3    d  NaN  NaN  NaN
    8   3  NaN  9.0  3.0  7.0
    9   4    e  8.0  2.0  2.0
    10  4    e  8.0  2.0  2.0
    11  5    g  2.0  9.0  NaN
    12  6    h  3.0  3.0  0.0
    13  6    h  3.0  3.0  0.0