Search code examples
pythonpandasmergerowsfuse

Pandas - Reducing multiple rows with almost identical data into one


I am working with data that is almost identical, consisting of IDs and Type, however there could be multiple different types per ID. How can I merge/fuse each ID with all its respective types?

Current data form:

data = {'Name': {0: np.nan, 1: np.nan, 2: np.nan, 3: np.nan, 4: np.nan},
 'ID': {0: '1',
  1: '2',
  2: '3',
  3: '3',
  4: '4'},
 'Type': {0: 'TypeUp', 1: 'TypeExamp', 2: 'TypeUp', 3: 'TypeExamp', 4: 'TypeUp'},
 'Created': {0: date(2022, 6, 27),
  1: date(2022, 6, 28),
  2: date(2022, 6, 29),
  3: date(2022, 6, 29),
  4: date(2022, 6, 29)},
 'Updated': {0: date(2022, 6, 27),
  1: date(2022, 6, 28),
  2: date(2022, 6, 29),
  3: date(2022, 6, 29),
  4: date(2022, 6, 29)}}

Output:

Name ID    Type     Created     Updated
NaN  1     TypeUp  2022-06-27  2022-06-27
NaN  2  TypeExamp  2022-06-28  2022-06-28
NaN  3     TypeUp  2022-06-29  2022-06-29
NaN  3  TypeExamp  2022-06-29  2022-06-29
NaN  4     TypeUp  2022-06-29  2022-06-29

What I would like to achieve is essentially this:

Name ID    Type     Created     Updated.    Type2           Created2       Updated2
NaN  1     TypeUp  2022-06-27  2022-06-27   NaN             NaN            NaN
NaN  2  TypeExamp  2022-06-28  2022-06-28   NaN             NaN            NaN
NaN  3     TypeUp  2022-06-29  2022-06-29   TypeExamp       2022-06-29     2022-06-29
NaN  4     TypeUp  2022-06-29  2022-06-29   NaN             NaN            NaN

I have tried multiple solutions: groupby, merging on same dataframe, unstacking, none of them seem to work for my case. Note that there might be 7+ different Types and each with their own date columns, and a user might have them all, so it is very confusing and would likely take up a lot of columns if I understand correctly. I am having hard time wrapping my head around this when dealing with many columns that are almost identical but just different in one aspect.


Solution

  • You can use:

    out = (df.assign(Cat=df.groupby('ID').cumcount().add(1).astype(str))
             .pivot(['Name', 'ID'], 'Cat'))
    out.columns = out.columns.to_flat_index().str.join('-')
    out = out.reset_index().rename_axis(columns=None)
    

    Output:

    >>> out
       Name ID     Type-1     Type-2   Created-1   Created-2   Updated-1   Updated-2
    0   NaN  1     TypeUp        NaN  2022-06-27         NaN  2022-06-27         NaN
    1   NaN  2  TypeExamp        NaN  2022-06-28         NaN  2022-06-28         NaN
    2   NaN  3     TypeUp  TypeExamp  2022-06-29  2022-06-29  2022-06-29  2022-06-29
    3   NaN  4     TypeUp        NaN  2022-06-29         NaN  2022-06-29         NaN