Imagine I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({"a":["","DATE","01-01-2012"],
"b":["","ID",18],
"c":["CLASS A","GOLF",3],
"d":["","HOCKEY",4],
"e":["","BASEBALL",2],
"f":["CLASS B","GOLF",15],
"g":["","HOCKEY",2],
"h":["","BASEBALL",3]
})
Out[33]:
a b c d e f g h
0 CLASS A CLASS B
1 DATE ID GOLF HOCKEY BASEBALL GOLF HOCKEY BASEBALL
2 01-01-2012 18 3 4 2 15 2 3
I would like to add the strings in the first row to the names of those sports on the row below, but only before the beginning of the next "Class". Does anyone know how can I do that?
So the result should be like this:
a b c ... f g h
0 CLASS A ... CLASS B
1 DATE ID CLASS A GOLF ... CLASS B GOLF CLASS B HOCKEY CLASS B BASEBALL
2 01-01-2012 18 3 ... 15 2 3
Later I will make the row 1 to be my header names, but this part I know how to do. I already tried to use df.iterrows but I got confused with the workflow.
Using replace
+ffill
to forward the CLASS, and a boolean mask to change the strings by boolean indexing:
s = df.loc[0].replace('', np.nan).ffill()
m = s.notna()
df.loc[1, m] = s[m]+' '+df.loc[1, m]
Output:
a b c d e f g h
0 CLASS A CLASS B
1 DATE ID CLASS A GOLF CLASS A HOCKEY CLASS A BASEBALL CLASS B GOLF CLASS B HOCKEY CLASS B BASEBALL
2 01-01-2012 18 3 4 2 15 2 3
Side note: it might be better to have those two rows as a MultiIndex rather than rows with strings. This would enable you to benefit from vectorized operations on your numeric data.