Search code examples
pythonpandasdataframepivot-table

Pivot table on a column which contains two indices


I have an initial dataframe that looks like this:

   Row tags                  Values
0  4                         100
1  100101 - Hospital A.xls   30
2  100195 - Hospital B.xls   30
3  100105 - Hospital C.xls   40
4  5                         50
5  100101 - Hospital A.xls   25
6  100195 - Hospital B.xls   25

Whenever there is a number in Row tags (i.e. without the .xls ending), its value corresponds to the sum of the rows that follow beneath it. There can be an arbitrary number of rows that follow.

I would like to pivot on those numbers, to obtain the desired output:

   Code   Hospital                  Values
0  4      100101 - Hospital A.xls   30
1  4      100195 - Hospital B.xls   30
2  4      100105 - Hospital C.xls   40
3  5      100101 - Hospital A.xls   25
4  5      100195 - Hospital B.xls   25

What I have tried to do:

As there is a pattern in Row tags that can be captured with a Regex, I created an intermediary table with a boolean column which distinguishes the entries that are numbers with those that end with .xls.

   Row tags                  Values   Regex pattern
0  4                         100      False
1  100101 - Hospital A.xls   30       True
2  100195 - Hospital B.xls   30       True
3  100105 - Hospital C.xls   40       True
4  5                         50       False
5  100101 - Hospital A.xls   25       True
6  100195 - Hospital B.xls   25       True

But I'm stuck here.


Solution

  • import pandas as pd
    
    data = {
        'Row tags': ['4', '100101 - Hospital A.xls', '100195 - Hospital B.xls', '100105 - Hospital C.xls', '5', '100101 - Hospital A.xls', '100195 - Hospital B.xls'],
        'Values': [100, 30, 30, 40, 50, 25, 25]
    }
    df = pd.DataFrame(data)
    
    m = df['Row tags'].str.contains('.xls')
    df = df.assign( Code = df['Row tags'].where(m == False).ffill() )[m]
    
    print(df[['Code', 'Row tags', 'Values']])
    
      Code                 Row tags  Values
    1    4  100101 - Hospital A.xls      30
    2    4  100195 - Hospital B.xls      30
    3    4  100105 - Hospital C.xls      40
    5    5  100101 - Hospital A.xls      25
    6    5  100195 - Hospital B.xls      25