Search code examples
pythonpandasdataframetranspose

Reshape pandas DataFrame from wide to long by splitting


I am trying to reshape the following data from wide to long format

df = pd.DataFrame(
    {
        "size_Ent": {
            pd.Timestamp("2021-01-01 00:00:00"): 600,
            pd.Timestamp("2021-01-02 00:00:00"): 930,
        },
        "size_Baci": {
            pd.Timestamp("2021-01-01 00:00:00"): 700,
            pd.Timestamp("2021-01-02 00:00:00"): 460,
        },
        "min_area_Ent": {
            pd.Timestamp("2021-01-01 00:00:00"): 1240,
            pd.Timestamp("2021-01-02 00:00:00"): 1503,
        },
        "min_area_Baci": {
            pd.Timestamp("2021-01-01 00:00:00"): 1285,
            pd.Timestamp("2021-01-02 00:00:00"): 953,
        },
    }
)
            size_Ent  size_Baci  min_area_Ent  min_area_Baci
2021-01-01       600        700          1240           1285
2021-01-02       930        460          1503            953

The problem is that the column names contain two different pieces of information separated by an underscore:

  1. The property/variable that was measured (e.g. size or min_area). I'd like these to remain as column names (without duplicates).
  2. A label for the item that was measured (e.g., Ent or Baci). I'd like these labels to become the values of a new column called 'bacterium'.

Additionally, I'd like the row indexes to remain as timestamps.

It should look like this:

           bacterium  min_area  size
2021-01-01      Baci      1285   700
2021-01-01       Ent      1240   600
2021-01-02      Baci       953   460
2021-01-02       Ent      1503   930

I tried transposing the data frame with df.T but this did not give the result I want.


Solution

  • This can be solved in three simple steps:

    First, notice that your column names are actually encoding a 2x2 MultiIndex, so let's start by creating a MultiIndex from tuples. To do this, we need to first transform the existing column names into tuples. This is easy because we know they should be split at the last underscore.

    # Convert column names into MultiIndex, giving an informative name to the level with label data
    column_tuples = df.columns.str.rsplit("_", n=1)
    column_tuples = [tuple(c) for c in column_tuples]
    df.columns = pd.MultiIndex.from_tuples(column_tuples,names=[None,'bacterium'])
    

    Next, use df.stack() to take the 'bacterium' level from the column MultiIndex and move it into a row MultiIndex. This is not quite the same as the transpose operation that you tried.

    df = df.stack('bacterium')
    

    Finally, use df.reset_index() with the level argument to take the bacterium level from the row MultiIndex and make it a proper column.

    df = df.reset_index('bacterium')
    

    Result:

               bacterium  min_area  size
    2021-01-01      Baci      1285   700
    2021-01-01       Ent      1240   600
    2021-01-02      Baci       953   460
    2021-01-02       Ent      1503   930