Search code examples
pandasdataframeapplynanfillna

Why pandas fillna function turns non empty values to empty values?


I'm trying to fill empty values with the element with max count after grouping the dataframe. Here is my code.

def fill_with_maxcount(x):
    try:
        return x.value_counts().index.tolist()[0]
    except Exception as e:
        return np.NaN


df_all["Surname"] = df_all.groupby(['HomePlanet','CryoSleep','Destination']).Surname.apply(lambda x : x.fillna(fill_with_maxcount(x)))

If there is an error occurred in try catch, it would return np.NaN value. But in the function fill_with_maxcount I tried logging the error also. But there is no exception occurred during the try catch.

Before the execution of the code lines, there are 294 nan values. After the execution it has incresed to 857 nan values, which means it has turned non-empty values into nan values. I can't figure out why. I did some experiments using print statements. It returns a non-empty value (a string) as the result of the function. So the problem should be with the pandas dataframe's apply or fillna function. But I have used this same method in other places without any problem.

Can someone give me a suggestion. Thank you


Solution

  • Finally found it after some testings with code.

     df_all.groupby(['HomePlanet','CryoSleep','Destination']).Surname.apply(lambda x : x.fillna(fill_with_maxcount(x)))
    

    The above part returns a series with filled values. But however in the rows where the fields used for grouping are empty, it doesn't consider it for applying the function. So those indexes will be returned as null. then that series is directly assigned into the Surname column. So those values become null too.

    As the solution I changed the code as the following.

    def fill_with_maxcount(x):
        try:
            return x.value_counts().index.tolist()[0]
        except Exception as e:
            return np.NaN
        
    def replace_only_null(x,z):
        for i in range(len(x)):
            if x[i]==None or x[i]==np.NaN:
                yield z[i]
            else:
                yield x[i]
    
    result_1 = df_all.groupby(['HomePlanet','CryoSleep','Destination']).Surname.apply(lambda x : x.fillna(fill_with_maxcount(x)))
    replaced = pd.Series(np.array(list(replace_only_null(df_all.Surname,result_1))))
    
    df_all.Surname = replaced
    

    The replace_only_null function will compare the result with current Surname columns and replace only null values with result retrieved by applying fill_with_maxcount function.