Search code examples
pythonpandas

Column manipulation based on headline values within rows


I have a Pandas dataframe with a column that contains different types of values and I want to create a new column out of it based on the information inside that column. Every few rows there is a kind of "headline" row that should define that values for the following rows until the next headline row that then defines the values for the next rows and so on.

To understand better, here is an example:

import pandas as pd

import pandas as pd

data = {'AA': ['', '', '', 'V_525-124', 'gsdgsd', 'hdfjhdf', 'gsdhsdhsd', 'gsdgsd', 'V_535-623', 'hosdfjk', 'hjodfjh', 'hjsdfjo',  'V_563-534', 'hojhdfhjdf', 'hodfjhjdfj', 'hofoj', 'hkdfphdf']}
df = pd.DataFrame(data)

print(df)

I want to create a new column BB that would look like that:

import pandas as pd

data = {'AA': ['', '', '', 'V_525-124', 'gsdgsd', 'hdfjhdf', 'gsdhsdhsd', 'gsdgsd', 'V_535-623', 'hosdfjk', 'hjodfjh', 'hjsdfjo',  'V_563-534', 'hojhdfhjdf', 'hodfjhjdfj', 'hofoj', 'hkdfphdf'],
'BB': ['', '', '', 'V_525-124', 'V_525-124', 'V_525-124', 'V_525-124', 'V_525-124', 'V_535-623', 'V_535-623', 'V_535-623', 'V_535-623',  'V_563-534', 'V_563-534', 'V_563-534', 'V_563-534', 'V_563-534']}
df = pd.DataFrame(data)

print(df)

The number of rows under each "headline" varies, so the script should sort of check whether the next row is a headline-type, then add the headline value to column BB and then move on down the table until a new headline is detected. I can only think of a for-loop with indices and if-statements but I am sure Pandas offers a more elegant solution.

The "headlines" all start with 'V_' if that helps.


Solution

  • You can use where and ffill (forward fill) without the need for loops:

    • df['AA'].where(df['AA'].str.startswith('V_')).ffill().fillna('')
      • str.startswith to identify rows where AA column starts with 'V_'.
      • where to keep identified headline rows in BB column and set other rows to NaN.
      • ffill to forward fill the last valid headline value down the column until the next headline is identified.
      • fillna('') to replace remaining NaN values with empty strings
    import pandas as pd
    
    data = {'AA': ['', '', '', 'V_525-124', 'gsdgsd', 'hdfjhdf', 'gsdhsdhsd', 'gsdgsd', 'V_535-623', 'hosdfjk', 'hjodfjh', 'hjsdfjo',  'V_563-534', 'hojhdfhjdf', 'hodfjhjdfj', 'hofoj', 'hkdfphdf']}
    df = pd.DataFrame(data)
    
    df['BB'] = df['AA'].where(df['AA'].str.startswith('V_')).ffill().fillna('')
      
    print(df)