I will always be working with < 1000 lines of data, not too concerned with data performance
The transformation will would never exceed OP#5
Please advise the best way to do this in Pandas or pure python.
I attempted to do this with dataframe.pivot_table
. The issue lies in the index repeats itself, and then it will simply create rows for each CAT
rather than simply sticking the 1st CAT
based on the index into OP#1
and then the 2nd CAT
based on the duplicate index into OP#2
, etc.
Data before transform
ID | MISC | CAT | TIME |
---|---|---|---|
id1 | 123 | CAT A | VALUE |
id1 | 123 | CAT B | VALUE |
id1 | 123 | CAT C | VALUE |
id1 | 123 | CAT C | VALUE |
id1 | 123 | CAT E | VALUE |
id2 | 145 | CAT A | VALUE |
id2 | 145 | CAT B | VALUE |
id3 | 101 | CAT D | VALUE |
id3 | 101 | CAT E | VALUE |
id4 | 500 | CAT D | VALUE |
id5 | 600 | CAT E | VALUE |
Data after transform.
ID | MISC | OP#1 | TIME | OP#2 | TIME | OP#3 | TIME | OP#4 | TIME | OP#5 | TIME |
---|---|---|---|---|---|---|---|---|---|---|---|
id1 | 123 | CAT A | VALUE | CAT B | VALUE | CAT C | VALUE | CAT D | VALUE | CAT E | VALUE |
id2 | 145 | CAT A | VALUE | CAT B | VALUE | ||||||
id3 | 101 | CAT E | VALUE | CAT E | VALUE | ||||||
id4 | 500 | CAT D | VALUE | ||||||||
id5 | 600 | CAT E | VALUE |
Code
out = (df.assign(cc=df.groupby('ID').cumcount().add(1))
.pivot(index=['ID', 'MISC'], columns='cc')
.sort_index(axis=1, level=1)
.pipe(lambda x: x.set_axis(
x.columns.map(lambda x: f'OP#{x[1]}' if x[0] == 'CAT' else 'TIME'),
axis=1))
.reset_index()
)
out