Search code examples
pythonpandasdata-cleaning

How to clean dataframe column filled with names using Python?


I have the following dataframe:

df = pd.DataFrame( columns = ['Name']) 
df['Name'] = ['Aadam','adam','AdAm','adammm','Adam.','Bethh','beth.','beht','Beeth','Beth']

I want to clean the column in order to achieve the following:

df['Name Corrected'] = ['adam','adam','adam','adam','adam','beth','beth','beth','beth','beth']
df

Cleaned names are based on the following reference table:

ref = pd.DataFrame( columns = ['Cleaned Names']) 
ref['Cleaned Names'] = ['adam','beth']

I am aware of fuzzy matching but I'm not sure if that's the most efficient way of solving the problem.


Solution

  • You can try:

    lst=['adam','beth']
    out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x})  for x in lst],axis=1)
    df['Name corrected']=out.bfill(axis=1).iloc[:,0]
    #Finally:
    df['Name corrected']=df['Name corrected'].ffill()
    #but In certain condition ffill() gives you wrong values
    

    Explaination:

    lst=['adam','beth']
    #created a list of words
    out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x})  for x in lst],axis=1)
    #checking If the 'Name' column contain the word one at a time that are inside the list and that will give a boolean series of True and False and then we are mapping The value of that particular element that is inside list so True becomes that value and False become NaN and then we are concatinating both list of Series on axis=1 so that It becomes a Dataframe
    df['Name corrected']=out.bfill(axis=1).iloc[:,0]
    #Backword filling values on axis=1 and getting the 1st column
    #Finally:
    df['Name corrected']=df['Name corrected'].ffill()
    #Forward filling the missing values