I have a Pandas dataframe with a column that contains different types of values and I want to create a new column out of it based on the information inside that column. Every few rows there is a kind of "headline" row that should define that values for the following rows until the next headline row that then defines the values for the next rows and so on.
To understand better, here is an example:
import pandas as pd
import pandas as pd
data = {'AA': ['', '', '', 'V_525-124', 'gsdgsd', 'hdfjhdf', 'gsdhsdhsd', 'gsdgsd', 'V_535-623', 'hosdfjk', 'hjodfjh', 'hjsdfjo', 'V_563-534', 'hojhdfhjdf', 'hodfjhjdfj', 'hofoj', 'hkdfphdf']}
df = pd.DataFrame(data)
print(df)
I want to create a new column BB that would look like that:
import pandas as pd
data = {'AA': ['', '', '', 'V_525-124', 'gsdgsd', 'hdfjhdf', 'gsdhsdhsd', 'gsdgsd', 'V_535-623', 'hosdfjk', 'hjodfjh', 'hjsdfjo', 'V_563-534', 'hojhdfhjdf', 'hodfjhjdfj', 'hofoj', 'hkdfphdf'],
'BB': ['', '', '', 'V_525-124', 'V_525-124', 'V_525-124', 'V_525-124', 'V_525-124', 'V_535-623', 'V_535-623', 'V_535-623', 'V_535-623', 'V_563-534', 'V_563-534', 'V_563-534', 'V_563-534', 'V_563-534']}
df = pd.DataFrame(data)
print(df)
The number of rows under each "headline" varies, so the script should sort of check whether the next row is a headline-type, then add the headline value to column BB and then move on down the table until a new headline is detected. I can only think of a for-loop with indices and if-statements but I am sure Pandas offers a more elegant solution.
The "headlines" all start with 'V_' if that helps.
You can use where
and ffill
(forward fill) without the need for loops:
df['AA'].where(df['AA'].str.startswith('V_')).ffill().fillna('')
str.startswith
to identify rows where AA column starts with 'V_'.where
to keep identified headline rows in BB column and set other rows to NaN.ffill
to forward fill the last valid headline value down the column until the next headline is identified.fillna('')
to replace remaining NaN values with empty stringsimport pandas as pd
data = {'AA': ['', '', '', 'V_525-124', 'gsdgsd', 'hdfjhdf', 'gsdhsdhsd', 'gsdgsd', 'V_535-623', 'hosdfjk', 'hjodfjh', 'hjsdfjo', 'V_563-534', 'hojhdfhjdf', 'hodfjhjdfj', 'hofoj', 'hkdfphdf']}
df = pd.DataFrame(data)
df['BB'] = df['AA'].where(df['AA'].str.startswith('V_')).ffill().fillna('')
print(df)