I have an initial dataframe that looks like this:
Row tags Values
0 4 100
1 100101 - Hospital A.xls 30
2 100195 - Hospital B.xls 30
3 100105 - Hospital C.xls 40
4 5 50
5 100101 - Hospital A.xls 25
6 100195 - Hospital B.xls 25
Whenever there is a number in Row tags
(i.e. without the .xls ending), its value corresponds to the sum of the rows that follow beneath it. There can be an arbitrary number of rows that follow.
I would like to pivot on those numbers, to obtain the desired output:
Code Hospital Values
0 4 100101 - Hospital A.xls 30
1 4 100195 - Hospital B.xls 30
2 4 100105 - Hospital C.xls 40
3 5 100101 - Hospital A.xls 25
4 5 100195 - Hospital B.xls 25
What I have tried to do:
As there is a pattern in Row tags
that can be captured with a Regex, I created an intermediary table with a boolean column which distinguishes the entries that are numbers with those that end with .xls.
Row tags Values Regex pattern
0 4 100 False
1 100101 - Hospital A.xls 30 True
2 100195 - Hospital B.xls 30 True
3 100105 - Hospital C.xls 40 True
4 5 50 False
5 100101 - Hospital A.xls 25 True
6 100195 - Hospital B.xls 25 True
But I'm stuck here.
import pandas as pd
data = {
'Row tags': ['4', '100101 - Hospital A.xls', '100195 - Hospital B.xls', '100105 - Hospital C.xls', '5', '100101 - Hospital A.xls', '100195 - Hospital B.xls'],
'Values': [100, 30, 30, 40, 50, 25, 25]
}
df = pd.DataFrame(data)
m = df['Row tags'].str.contains('.xls')
df = df.assign( Code = df['Row tags'].where(m == False).ffill() )[m]
print(df[['Code', 'Row tags', 'Values']])
Code Row tags Values
1 4 100101 - Hospital A.xls 30
2 4 100195 - Hospital B.xls 30
3 4 100105 - Hospital C.xls 40
5 5 100101 - Hospital A.xls 25
6 5 100195 - Hospital B.xls 25