I have a pandas column that contains a lot of string that appear less than 5 times, I do not to remove these values however I do want to replace them with a placeholder string called "pruned". What is the best way to do this?
df= pd.DataFrame(['a','a','b','c'],columns=["x"])
# get value counts and set pruned I want something that does as follows
df[df[count<2]] = "pruned"
I suspect there is a more efficient way to do this, but simple way to do it is to build a dict of counts and then prune if those values are below the count threshold. Consider the example df
:
df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
foo
0 12
1 11
2 4
3 15
4 6
5 12
6 4
7 7
# make a dict with counts
count_dict = {d:(df['foo']==d).sum() for d in df.foo.unique()}
# assign that dict to a column
df['bar'] = [count_dict[d] for d in df.foo]
# loc in the 'pruned' tag
df.loc[df.bar < 2, 'foo']='pruned'
Returns as desired:
foo bar
0 12 2
1 pruned 1
2 4 2
3 pruned 1
4 pruned 1
5 12 2
6 4 2
7 pruned 1
(and of course you would change 2 to 5 and dump that bar
column if you want).
UPDATE
Per request for an in-place version, here is a one-liner that can do it without assigning another column or creating that dict directly (and thanks @TrigonaMinima for the values_count()
tip):
df= pd.DataFrame([12,11,4,15,6,12,4,7],columns=['foo'])
print(df)
df.foo = df.foo.apply(lambda row: 'pruned' if (df.foo.value_counts() < 2)[row] else row)
print(df)
which returns again as desired:
foo
0 12
1 11
2 4
3 15
4 6
5 12
6 4
7 7
foo
0 12
1 pruned
2 4
3 pruned
4 pruned
5 12
6 4
7 pruned