Search code examples
pythonpandassklearn-pandas

Pandas how update values with counts greater x


I have a pandas column that contains a lot of string that appear less than 5 times, I do not to remove these values however I do want to replace them with a placeholder string called "pruned". What is the best way to do this?

df= pd.DataFrame(['a','a','b','c'],columns=["x"])
# get value counts and set pruned I want something that does as follows
df[df[count<2]] = "pruned"

Solution

  • I suspect there is a more efficient way to do this, but simple way to do it is to build a dict of counts and then prune if those values are below the count threshold. Consider the example df:

    df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
    
        foo
    0   12
    1   11
    2   4
    3   15
    4   6
    5   12
    6   4
    7   7
    
    # make a dict with counts
    count_dict = {d:(df['foo']==d).sum() for d in df.foo.unique()}
    # assign that dict to a column
    df['bar'] = [count_dict[d] for d in df.foo]
    # loc in the 'pruned' tag
    df.loc[df.bar < 2, 'foo']='pruned'
    

    Returns as desired:

        foo bar
    0   12      2
    1   pruned  1
    2   4       2
    3   pruned  1
    4   pruned  1
    5   12      2
    6   4       2
    7   pruned  1
    

    (and of course you would change 2 to 5 and dump that bar column if you want).

    UPDATE

    Per request for an in-place version, here is a one-liner that can do it without assigning another column or creating that dict directly (and thanks @TrigonaMinima for the values_count() tip):

    df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
    print(df)
    df.foo = df.foo.apply(lambda row: 'pruned' if (df.foo.value_counts() < 2)[row] else row)
    print(df)
    

    which returns again as desired:

       foo
    0   12
    1   11
    2    4
    3   15
    4    6
    5   12
    6    4
    7    7
          foo
    0      12
    1  pruned
    2       4
    3  pruned
    4  pruned
    5      12
    6       4
    7  pruned