Search code examples
pythonpandaspivotconcatenationpandas-groupby

Pandas: Pivoting dataframe to one row per User (group)


Firstly, this is not a duplicate question, I have looked at similar questions, but the solutions are not correct for this example.

Here is a simplified example of my pandas dataframe:

  User Category Frequency    Visit 1 Date_Time    Visit 2 Date_Time  \

0    A    alpha         1  2020-03-01 00:00:00                    -   

1    A     beta         2  2020-04-01 00:00:00  2020-06-01 00:00:00   

2    B    alpha         3  2020-06-01 00:00:00  2020-07-01 00:00:00   

3    B     beta         1  2020-07-01 00:00:00                    -   

4    B    gamma         3  2018-10-01 00:00:00  2019-06-01 00:00:00   

5    C    gamma         1  2019-08-01 00:00:00                    -   

6    C   indigo         1  2018-09-01 00:00:00                    -   



     Visit 3 Date_Time Visit 4 Date_Time  

0                    -          -  

1                    -          - 

2  2020-08-01 00:00:00          -  

3                    -          -             

4  2020-10-01 00:00:00          - 

5                    -          -        

6                    -          - 

I would like to transform this dataframe so that it has exactly one row per User, like below. One important thing to note is that the Visit Date_Time columns are not fixed, the code should also work if Visit 5 Date_Time existed, for example.

  User  alpha_Frequency alpha_Visit 1 Date_Time alpha_Visit 2 Date_Time  \

0    A                1     2020-03-01 00:00:00                       -   

1    B                3     2020-06-01 00:00:00     2020-07-01 00:00:00   

2    C                0                       -                       -   



  alpha_Visit 3 Date_Time alpha_Visit 4 Date_Time  beta_Frequency  \

0                       -                       -               2   

1     2020-08-01 00:00:00                       -               1   

2                       -                       -               0   



  beta_Visit 1 Date_Time beta_Visit 2 Date_Time beta_Visit 3 Date_Time  \

0    2020-04-01 00:00:00    2020-06-01 00:00:00                      -   

1    2020-07-01 00:00:00                      -                      -   

2                      -                      -                      -   



            ...            gamma_Frequency  gamma_Visit 1 Date_Time  \

0           ...                          0                        -   

1           ...                          3      2018-10-01 00:00:00   

2           ...                          1      2019-08-01 00:00:00   



  gamma_Visit 2 Date_Time gamma_Visit 3 Date_Time gamma_Visit 4 Date_Time  \

0                       -                       -                       -   

1     2019-06-01 00:00:00     2020-10-01 00:00:00                       -   

2                       -                       -                       -   



  indigo_Frequency  indigo_Visit 1 Date_Time indigo_Visit 2 Date_Time  \

0                0                         -                        -   

1                0                         -                        -   

2                1       2018-09-01 00:00:00                        -   



  indigo_Visit 3 Date_Time indigo_Visit 4 Date_Time  

0                        -                        -  

1                        -                        -  

2                        -                        - 

Essentially, it is concatenating the Category column with the Frequency and Visit Date_Time columns, then pivoting.

The code below I have tried, but it does not quite work.

df.pivot_table(index='User', 
                     columns=df.groupby('Category').cumcount(),
                     aggfunc='first').sort_index(axis=1, level=1)

Solution

  • This code should solve your problem

    import pandas as pd
    
    # Example df
    data = {'user': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
            'category':['a','b', 'a', 'b', 'g', 'g', 'i'],
            'freq': [1, 2, 3, 1, 3, 1, 2],
            'visit_1_date': ['2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01']
            }
    df = pd.DataFrame(data)
    
    result = df.pivot(index='user', columns='category')
    print(result.head())
    
    # OUTPUT
             freq                visit_1_date                                    
    category    a    b    g    i            a           b           g           i
    user                                                                         
    A         1.0  2.0  NaN  NaN   2019-10-01  2019-10-01         NaN         NaN
    B         3.0  1.0  3.0  NaN   2019-10-01  2019-10-01  2019-10-01         NaN
    C         NaN  NaN  1.0  2.0          NaN         NaN  2019-10-01  2019-10-01
    

    If you try the same code but with a dataset with more columns like visit_2_date, ... you will get the expanded results.

    data = {'user': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
            'category':['a','b', 'a', 'b', 'g', 'g', 'i'],
            'freq': [1, 2, 3, 1, 3, 1, 2],
            'visit_1_date': ['2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01'],
            'visit_2_date': ['2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01']
            }
    
    df = pd.DataFrame(data)
    
    result = df.pivot(index='user', columns='category')
    print(result.head())
    
    # OUTPUT
             freq                 ... visit_2_date                                    
    category    a    b    g    i  ...            a           b           g           i
    user                          ...                                                 
    A         1.0  2.0  NaN  NaN  ...   2019-10-01  2019-10-01         NaN         NaN
    B         3.0  1.0  3.0  NaN  ...   2019-10-01  2019-10-01  2019-10-01         NaN
    C         NaN  NaN  1.0  2.0  ...          NaN         NaN  2019-10-01  2019-10-01
    

    Basically, when we pivot using user as index-parameter we force to return one-row per user. If we setting category as columns-parameter without specifying the values-parameter, the function will use remaining columns as values.

    Check the documentation for further information