Search code examples
pythonpandasdataframegroup-by

Pandas - Filling Missing values from list in Groups


I have a data frame with multiple category fields. I would like to fill the missing values in a column from other available list of values and this missing value should follow the group order.

This number list is not fixed. It will change based on 'Code B' and 'Code A'. We can't fix the number list.


| Code A | Code B | Hour | number | number list |
| ------ | ------ | ---- | ------ | ----------- |
| ABC    | ABC.1A | 2    | A      | [A,B,C,D,E] |
| ABC    | ABC.1A | 2    | NaN    | [A,B,C,D,E] |
| ABC    | ABC.1A | 2    | NaN    | [A,B,C,D,E] |
| ABC    | ABC.1A | 2    | B      | [A,B,C,D,E] |
| ABC    | ABC.1A | 2    | NaN    | [A,B,C,D,E] |
| ABC    | ABC.2B | 4    | NaN    | [AB,BD,CF,DH,EK] |
| ABC    | ABC.2B | 4    | CF     | [AB,BD,CF,DH,EK] |
| ABC    | ABC.2B | 4    | Nan    | [AB,BD,CF,DH,EK] |
| ABC    | ABC.2B | 4    | NaN    | [AB,BD,CF,DH,EK] |
| ABC    | ABC.2B | 4    | AB     | [AB,BD,CF,DH,EK] |

I would like to fill the missing value based on Hour group and want below data frame.


| Code A | Code B | Hour | number | number list |
| ------ | ------ | ---- | ------ | ----------  |
| ABC    | ABC.1A | 2    | A      | [A,B,C,D,E] |
| ABC    | ABC.1A | 2    | C      | [A,B,C,D,E] |
| ABC    | ABC.1A | 2    | D      | [A,B,C,D,E] |
| ABC    | ABC.1A | 2    | B      | [A,B,C,D,E] |
| ABC    | ABC.1A | 2    | E      | [A,B,C,D,E] |
| ABC    | ABC.2B | 4    | DH     | [AB,BD,CF,DH,EK] |
| ABC    | ABC.2B | 4    | CF     | [AB,BD,CF,DH,EK] |
| ABC    | ABC.2B | 4    | BD     | [AB,BD,CF,DH,EK] |
| ABC    | ABC.2B | 4    | EK     | [AB,BD,CF,DH,EK] |
| ABC    | ABC.2B | 4    | AB     | [AB,BD,CF,DH,EK] |

Not able to figure out how to keep the running list and pop out a number if available in 'number' column otherwise put the number. This also needs to be in an hour (and other two Code A, Code B) groups.


Solution

  • Using a custom groupby.apply and a generator:

    def cust_fill(g, to_fill='number', ref='number list'):
        existing = set(g[to_fill].dropna())
        values = (v for v in df[ref].iloc[0] if v not in existing)
        return pd.Series([next(values, None) if pd.isna(value) else value
                          for value in g[to_fill]], index=g.index)
    
    df['number'] = df.groupby(['Code A', 'Code B'], group_keys=False).apply(cust_fill)
    

    Output:

      Code A  Code B  Hour number      number list
    0    ABC  ABC.1A     2      A  [A, B, C, D, E]
    1    ABC  ABC.1A     2      C  [A, B, C, D, E]
    2    ABC  ABC.1A     2      D  [A, B, C, D, E]
    3    ABC  ABC.1A     2      B  [A, B, C, D, E]
    4    ABC  ABC.1A     2      E  [A, B, C, D, E]
    5    ABC  ABC.2B     4      D  [A, B, C, D, E]
    6    ABC  ABC.2B     4      C  [A, B, C, D, E]
    7    ABC  ABC.2B     4      B  [A, B, C, D, E]
    8    ABC  ABC.2B     4      E  [A, B, C, D, E]
    9    ABC  ABC.2B     4      A  [A, B, C, D, E]