Search code examples
pythonpandasdata-cleaning

Pandas data cleansing to assign records below a row are assigned a particular value


In the below code, I'm using simple data manipulation to split the columns and remove unnecessary characters.

input_uni_towns = pd.read_fwf("university_towns.txt", sep = " ", header = None)
uni_towns = input_uni_towns.rename(columns={0: "Raw_data"})
uni_towns['Cleaned'] = uni_towns["Raw_data"].replace(regex=True,to_replace=[r'\[[^()]*\]'],value=r'')
uni_towns[["State","University"]] = uni_towns.Cleaned.str.split("(",n=1,expand=True) 
uni_towns["University"] = uni_towns["University"].str.rstrip(')')
cleaned_uni_towns = uni_towns[["State","University"]]

After this above step, I want to assign State to records whose above record has None assigned to it. For Ex: Auburn (Auburn University) current State is Auburn, but I want this to be updated to Alabama and similarly for records below Alabama till code encounters next State i.e. Alaska

This is the current output Current Output

This is the expected output

Expected  Output


Solution

  • You can keep put NaN in State when University is not None (i.e. that's a row with a university), then fill the NaN in State with the most recent non-NaN value.

    df = pd.DataFrame({'Cleaned': ['Alabama', 'Auburn (Auburn University)', 'Alaska']})
    df[['State', 'University']] = df.Cleaned.str.split('(', n = 1, expand = True)
    df.University = df.University.str.rstrip(')')
    
    df.State = np.where(df.University.map(lambda u: u is None), df.State, np.nan)
    df.State = df.State.fillna(method = 'ffill')