Fill missing data with random values from categorical column - Python

I'm working on a hotel booking dataset. Within the data frame, there's a discrete numerical column called ‘agent’ that has 13.7% missing values. My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.

My code is:

new_agent = hotel['agent'].dropna()

agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))

results

result is

The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>. Is there something wrong with my code, maybe in the lambda syntax?

UPDATE: Thanks ti7 helped me solved the problem:

new_agent = hotel['agent'].dropna() #get a series of just the available values

n_null = hotel['agent'].isnull().sum() #length of the missing entries

new_agent.sample(n_null,replace=True).values #sample it with repetition and get values

hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values #fill and replace

Solution

.fillna() is naively assigning your function to the missing values. It can do this because functions are really objects!

You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.

get a Series of just the available values (.dropna())
.sample() it with repetition (replace=True) to a new Series of the same length as the missing entries (df["agent"].isna().sum())
get the .values (this is a flat numpy array)
filter the column and assign

quick code

df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
    df["agent"].isna().sum(),  # get the same number of values as are missing
    replace=True               # repeat values
).values                       # throw out the index

demo

>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
   agent  b
0    1.0  3
1    2.0  4
2    NaN  5
3    NaN  6
4   10.0  7

>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])

>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
...     df["agent"].isna().sum(),
...     replace=True
... ).values
>>> df
   agent  b
0    1.0  3
1    2.0  4
2   10.0  5
3    2.0  6
4   10.0  7