Search code examples
pythonpandasdataframedatasetstr-replace

randomly replacing a specific value in a dataset with frac in pandas


I've got a dataset with some missing values as " ?" in just one column I want to replace all missing values with other values in that column (Feature1) like this:

Feature1_value_counts = df.Feature1.value_counts(normalize=True)

the code above gives me the number I can use for frac in pandas Feature1 contains 15 set of unique values so it has 15 numbers (all percentage)

and now I need to just randomly replace " ?"s with those unique values (All strings) with that frac probability

I don't know how to do this using pandas!

I've tried loc() and iloc() and also some for and ifs I couldn't get there


Solution

  • You can take advantage of the p parameter of numpy.random.choice:

    import numpy as np
    
    # ensure using real NaNs for missing values
    df['Feature1'] = df['Feature1'].replace('?', np.nan)
    
    # count the fraction of the non-NaN value
    counts = df['Feature1'].value_counts(normalize=True)
    # identify the rows with NaNs
    m = df['Feature1'].isna()
    
    # replace the NaNs with a random values with the frequencies as weights
    df.loc[m, 'Feature1'] = np.random.choice(counts.index, p=counts, size=m.sum())
    
    print(df)
    

    Output (replaced values as uppercase for clarity):

      Feature1
    0        a
    1        b
    2        a
    3        A
    4        a
    5        b
    6        B
    7        a
    8        A
    

    Used input:

    df = pd.DataFrame({'Feature1': ['a', 'b', 'a', np.nan, 'a', 'b', np.nan, 'a', np.nan]})