Firstly, this is not a duplicate question, I have looked at similar questions, but the solutions are not correct for this example.
Here is a simplified example of my pandas dataframe:
User Category Frequency Visit 1 Date_Time Visit 2 Date_Time \
0 A alpha 1 2020-03-01 00:00:00 -
1 A beta 2 2020-04-01 00:00:00 2020-06-01 00:00:00
2 B alpha 3 2020-06-01 00:00:00 2020-07-01 00:00:00
3 B beta 1 2020-07-01 00:00:00 -
4 B gamma 3 2018-10-01 00:00:00 2019-06-01 00:00:00
5 C gamma 1 2019-08-01 00:00:00 -
6 C indigo 1 2018-09-01 00:00:00 -
Visit 3 Date_Time Visit 4 Date_Time
0 - -
1 - -
2 2020-08-01 00:00:00 -
3 - -
4 2020-10-01 00:00:00 -
5 - -
6 - -
I would like to transform this dataframe so that it has exactly one row per User, like below. One important thing to note is that the Visit Date_Time columns are not fixed, the code should also work if Visit 5 Date_Time existed, for example.
User alpha_Frequency alpha_Visit 1 Date_Time alpha_Visit 2 Date_Time \
0 A 1 2020-03-01 00:00:00 -
1 B 3 2020-06-01 00:00:00 2020-07-01 00:00:00
2 C 0 - -
alpha_Visit 3 Date_Time alpha_Visit 4 Date_Time beta_Frequency \
0 - - 2
1 2020-08-01 00:00:00 - 1
2 - - 0
beta_Visit 1 Date_Time beta_Visit 2 Date_Time beta_Visit 3 Date_Time \
0 2020-04-01 00:00:00 2020-06-01 00:00:00 -
1 2020-07-01 00:00:00 - -
2 - - -
... gamma_Frequency gamma_Visit 1 Date_Time \
0 ... 0 -
1 ... 3 2018-10-01 00:00:00
2 ... 1 2019-08-01 00:00:00
gamma_Visit 2 Date_Time gamma_Visit 3 Date_Time gamma_Visit 4 Date_Time \
0 - - -
1 2019-06-01 00:00:00 2020-10-01 00:00:00 -
2 - - -
indigo_Frequency indigo_Visit 1 Date_Time indigo_Visit 2 Date_Time \
0 0 - -
1 0 - -
2 1 2018-09-01 00:00:00 -
indigo_Visit 3 Date_Time indigo_Visit 4 Date_Time
0 - -
1 - -
2 - -
Essentially, it is concatenating the Category column with the Frequency and Visit Date_Time columns, then pivoting.
The code below I have tried, but it does not quite work.
df.pivot_table(index='User',
columns=df.groupby('Category').cumcount(),
aggfunc='first').sort_index(axis=1, level=1)
This code should solve your problem
import pandas as pd
# Example df
data = {'user': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'category':['a','b', 'a', 'b', 'g', 'g', 'i'],
'freq': [1, 2, 3, 1, 3, 1, 2],
'visit_1_date': ['2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01']
}
df = pd.DataFrame(data)
result = df.pivot(index='user', columns='category')
print(result.head())
# OUTPUT
freq visit_1_date
category a b g i a b g i
user
A 1.0 2.0 NaN NaN 2019-10-01 2019-10-01 NaN NaN
B 3.0 1.0 3.0 NaN 2019-10-01 2019-10-01 2019-10-01 NaN
C NaN NaN 1.0 2.0 NaN NaN 2019-10-01 2019-10-01
If you try the same code but with a dataset with more columns like visit_2_date, ... you will get the expanded results.
data = {'user': ['A', 'A', 'B', 'B', 'B', 'C', 'C'],
'category':['a','b', 'a', 'b', 'g', 'g', 'i'],
'freq': [1, 2, 3, 1, 3, 1, 2],
'visit_1_date': ['2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01'],
'visit_2_date': ['2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01', '2019-10-01']
}
df = pd.DataFrame(data)
result = df.pivot(index='user', columns='category')
print(result.head())
# OUTPUT
freq ... visit_2_date
category a b g i ... a b g i
user ...
A 1.0 2.0 NaN NaN ... 2019-10-01 2019-10-01 NaN NaN
B 3.0 1.0 3.0 NaN ... 2019-10-01 2019-10-01 2019-10-01 NaN
C NaN NaN 1.0 2.0 ... NaN NaN 2019-10-01 2019-10-01
Basically, when we pivot using user as index-parameter we force to return one-row per user. If we setting category as columns-parameter without specifying the values-parameter, the function will use remaining columns as values.
Check the documentation for further information