Search code examples
pythonpandasdataframedata-munging

Programaticallly re-organising pandas slices


Edit for clarity

I have a set of dataframes that I need to combine programatically. Each one represents a particular participant over a set of treatment sessions. Each frame comes from 1 sheet of an excel workbook (one workbook per client, 1 worksheet per psychometric instrument) which I have imported into pandas.

The data frame below is an example of Participant a, instrument a:

|           | beaseline     | time 1    | time 1 change     | time 1 change from baseline   | time 2    | time 2 change     | time 2 change from baseline   |
|---------  |-----------    |--------   |---------------    |-----------------------------  |--------   |---------------    |-----------------------------  |
| item 1    | 3             | 4         | Nan               | 1                             | 4         | 0                 | 1                             |
| item 2    | 3             | 2         | Nan               | 0                             | 3         | 1                 | 0                             |
| item 3    | 1             | 1         | Nan               | 0                             | 2         | 1                 | 1                             |
| item 4    | 1             | 1         | Nan               | 0                             | 1         | 0                 | 0                             |
| item 5    | 1             | 2         | Nan               | 1                             | 1         | -1                | 0                             |

I need to combine this with observations from other participants in a tabular dataframe such that each row is a participant and each column is a reponse to an item at a particular time point.*

So I need to reshaped this data into something like this:

|       | Baseline item 1   | Baseline item 2   | Baseline item 3   | Baseline item 4   | Baseline item 5   | time 1 item 1     | time 1 item 2     |
|---    |-----------------  |-----------------  |-----------------  |-----------------  |-----------------  |---------------    |---------------    |
| 0     | 3                 | 3                 | 1                 | 1                 | 1                 | 4                 | 2                 |

Once I have done this reshaping for each dataframe I can then append them together and add the client code as the first column, thereby distinguishing each participant.

|       | Client code   | Baseline item 1   | Baseline item 2   | Baseline item 3   | Baseline item 4   | Baseline item 5   | time 1 item 1     | time 1 item 2     |
|---    |-------------  |-----------------  |-----------------  |-----------------  |-----------------  |-----------------  |---------------    |---------------    |
| 0     | xxi212121     | 3                 | 3                 | 1                 | 1                 | 1                 | 4                 | 2                 |               | 4                 | 2                 |

I have already used a lot of repeated code and essentially manually pulled each slice as a series, renamed it and then created a new dataframe:

baseline = pd.Dataframe(raw_data.iloc[2:,1])
time_1 = pd.Dataframe(raw_data.iloc[2:,2])

but obviously this is a really silly solution.

I need to able to repeat this across a lot of similar dataframes so I need a more useful solution, Any help would be really appreciated.


Solution

  • Create and flatten the MultiIndex columns, then with np.hstack horizontally stack all the rows in the dataframe and create a new dataframe from stacked rows with the flattened columns:

    i = pd.MultiIndex.from_product([df.columns[1:], df.iloc[:, 0]]).map('-'.join)
    s = pd.DataFrame([np.hstack(df.to_numpy()[:, 1:].T)], columns=i)
    

    Example (column names removed for simplicity):

    print(df)
    
            0  1  2    3  4  5  6  7
    0  item 1  3  4  Nan  1  4  0  1
    1  item 2  3  2  Nan  0  3  1  0
    2  item 3  1  1  Nan  0  2  1  1
    3  item 4  1  1  Nan  0  1  0  0
    4  item 5  1  2  Nan  1  1 -1  0
    

    print(s)
    
       1-item 1  1-item 2  1-item 3  1-item 4  1-item 5  2-item 1  2-item 2  2-item 3  2-item 4  2-item 5 3-item 1 3-item 2 3-item 3 3-item 4 3-item 5  4-item 1  4-item 2  4-item 3  4-item 4  4-item 5  5-item 1  5-item 2  5-item 3  5-item 4  5-item 5  6-item 1  6-item 2  6-item 3  6-item 4  6-item 5  7-item 1  7-item 2  7-item 3  7-item 4  7-item 5
    0         3         3         1         1         1         4         2         1         1         2      Nan      Nan      Nan      Nan      Nan         1         0         0         0         1         4         3         2         1         1         0         1         1         0        -1         1         0         1         0         0