Edit for clarity
I have a set of dataframes that I need to combine programatically. Each one represents a particular participant over a set of treatment sessions. Each frame comes from 1 sheet of an excel workbook (one workbook per client, 1 worksheet per psychometric instrument) which I have imported into pandas.
The data frame below is an example of Participant a, instrument a:
| | beaseline | time 1 | time 1 change | time 1 change from baseline | time 2 | time 2 change | time 2 change from baseline |
|--------- |----------- |-------- |--------------- |----------------------------- |-------- |--------------- |----------------------------- |
| item 1 | 3 | 4 | Nan | 1 | 4 | 0 | 1 |
| item 2 | 3 | 2 | Nan | 0 | 3 | 1 | 0 |
| item 3 | 1 | 1 | Nan | 0 | 2 | 1 | 1 |
| item 4 | 1 | 1 | Nan | 0 | 1 | 0 | 0 |
| item 5 | 1 | 2 | Nan | 1 | 1 | -1 | 0 |
I need to combine this with observations from other participants in a tabular dataframe such that each row is a participant and each column is a reponse to an item at a particular time point.*
So I need to reshaped this data into something like this:
| | Baseline item 1 | Baseline item 2 | Baseline item 3 | Baseline item 4 | Baseline item 5 | time 1 item 1 | time 1 item 2 |
|--- |----------------- |----------------- |----------------- |----------------- |----------------- |--------------- |--------------- |
| 0 | 3 | 3 | 1 | 1 | 1 | 4 | 2 |
Once I have done this reshaping for each dataframe I can then append them together and add the client code as the first column, thereby distinguishing each participant.
| | Client code | Baseline item 1 | Baseline item 2 | Baseline item 3 | Baseline item 4 | Baseline item 5 | time 1 item 1 | time 1 item 2 |
|--- |------------- |----------------- |----------------- |----------------- |----------------- |----------------- |--------------- |--------------- |
| 0 | xxi212121 | 3 | 3 | 1 | 1 | 1 | 4 | 2 | | 4 | 2 |
I have already used a lot of repeated code and essentially manually pulled each slice as a series, renamed it and then created a new dataframe:
baseline = pd.Dataframe(raw_data.iloc[2:,1])
time_1 = pd.Dataframe(raw_data.iloc[2:,2])
but obviously this is a really silly solution.
I need to able to repeat this across a lot of similar dataframes so I need a more useful solution, Any help would be really appreciated.
Create and flatten the MultiIndex
columns, then with np.hstack
horizontally stack all the rows in the dataframe and create a new dataframe from stacked
rows with the flattened columns:
i = pd.MultiIndex.from_product([df.columns[1:], df.iloc[:, 0]]).map('-'.join)
s = pd.DataFrame([np.hstack(df.to_numpy()[:, 1:].T)], columns=i)
Example (column names removed for simplicity):
print(df)
0 1 2 3 4 5 6 7
0 item 1 3 4 Nan 1 4 0 1
1 item 2 3 2 Nan 0 3 1 0
2 item 3 1 1 Nan 0 2 1 1
3 item 4 1 1 Nan 0 1 0 0
4 item 5 1 2 Nan 1 1 -1 0
print(s)
1-item 1 1-item 2 1-item 3 1-item 4 1-item 5 2-item 1 2-item 2 2-item 3 2-item 4 2-item 5 3-item 1 3-item 2 3-item 3 3-item 4 3-item 5 4-item 1 4-item 2 4-item 3 4-item 4 4-item 5 5-item 1 5-item 2 5-item 3 5-item 4 5-item 5 6-item 1 6-item 2 6-item 3 6-item 4 6-item 5 7-item 1 7-item 2 7-item 3 7-item 4 7-item 5
0 3 3 1 1 1 4 2 1 1 2 Nan Nan Nan Nan Nan 1 0 0 0 1 4 3 2 1 1 0 1 1 0 -1 1 0 1 0 0