Search code examples
pythonpandasdataframegroup-bypivot-table

Create new rows based on missing grouped by values


Given the below dataframe, if rows are grouped by first name and last name, how can I find and create new rows for a group that does not have a row for every type in the types list.

So in the example below, two new rows would be created for Bob Jack that are missing from the original dataframe: one with type 'DA' and another with type 'FA', the value columns can be set to 0.

data = {
    'First Name': ['Alice', 'Alice', 'Alice', 'Alice', 'Bob', 'Bob'],
    'Last Name': ['Johnson', 'Johnson', 'Johnson', 'Johnson', 'Jack', 'Jack'],
    'Type': ['CA', 'DA', 'FA', 'GCA', 'CA', 'GCA'],
    'Value': [25, 30, 35, 40, 50, 37]
}

types = ['CA', 'DA', 'FA', 'GCA']
df = pd.DataFrame(data)

Solution

  • One way to do this is to create a dataframe which is all the combinations of names and types, then left join that to the original dataframe. This will create a df with all combinations, with NaN values where there was a missing entry in the original data. That can then be filled with 0. Note that because the value column gets NaN values in it, it is converted to type float. You can convert that back to int if desired using astype({'Value': int}) in the chain:

    out = (df[['First Name', 'Last Name']]
        .drop_duplicates()
        .merge(pd.Series(types, name='Type'), how='cross')
        .merge(df, on=['First Name', 'Last Name', 'Type'], how='left')
        .fillna(0)
        # use this astype if you need `Value` to be an int
        .astype({'Value': int})
    )
    

    Output (with the astype to convert Value back to int):

      First Name Last Name Type  Value
    0      Alice   Johnson   CA     25
    1      Alice   Johnson   DA     30
    2      Alice   Johnson   FA     35
    3      Alice   Johnson  GCA     40
    4        Bob      Jack   CA     50
    5        Bob      Jack   DA      0
    6        Bob      Jack   FA      0
    7        Bob      Jack  GCA     37