Search code examples
pythonpandasdata-cleaningfillna

Fill missing data with random values from categorical column - Python


I'm working on a hotel booking dataset. Within the data frame, there's a discrete numerical column called ‘agent’ that has 13.7% missing values. My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.

My code is:

new_agent = hotel['agent'].dropna()

agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))

results

result is

The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>. Is there something wrong with my code, maybe in the lambda syntax?

UPDATE: Thanks ti7 helped me solved the problem:

new_agent = hotel['agent'].dropna() #get a series of just the available values

n_null = hotel['agent'].isnull().sum() #length of the missing entries

new_agent.sample(n_null,replace=True).values #sample it with repetition and get values

hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values #fill and replace


Solution

  • .fillna() is naively assigning your function to the missing values. It can do this because functions are really objects!

    You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.

    • get a Series of just the available values (.dropna())
    • .sample() it with repetition (replace=True) to a new Series of the same length as the missing entries (df["agent"].isna().sum())
    • get the .values (this is a flat numpy array)
    • filter the column and assign

    quick code

    df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
        df["agent"].isna().sum(),  # get the same number of values as are missing
        replace=True               # repeat values
    ).values                       # throw out the index
    

    demo

    >>> import pandas as pd
    >>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
    >>> df
       agent  b
    0    1.0  3
    1    2.0  4
    2    NaN  5
    3    NaN  6
    4   10.0  7
    
    >>> df["agent"].isna().sum()
    2
    >>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
    array([2., 1.])
    >>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
    array([2., 2.])
    
    >>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
    ...     df["agent"].isna().sum(),
    ...     replace=True
    ... ).values
    >>> df
       agent  b
    0    1.0  3
    1    2.0  4
    2   10.0  5
    3    2.0  6
    4   10.0  7