I have Pandas dataframe with hundreds of categoric features (in numbers). I want to leave only top values in columns. I do already know, that there are only 3 or 4 most frequent values in each column, but I want to select it automatically. I need two ways to do it:
1)leave only 3 most frequent values. Notion: there are no columns with 1, 2 or 3 unique values (~20 unique values in each column), so, do not consider it. If you have, for example, several third places, leave them all. For example:
#after you use value_counts() column 1
1 35
2 23
3 10
4 9
8 8
6 8
#after you use value_counts() on column 2
0 23
2 15
1 15 #two second places
4 9
5 3
6 2
#result after you use value_counts() on column 1
1 35
2 23
3 10
others 25 #9+8+8
#result after you use value_counts() on column 2
0 23
2 15
1 15
4 9
others 5 #3+2
2)leave as many values in each column as needed so that the number of remaining values is less than the number of the last values that you decided to leave. For example:
#after you use value_counts() column 1
1 35
2 23
3 10
4 3
8 2
6 1
#after you use value_counts() on column 2
0 23
2 15
1 9
4 8
5 3
6 2
#result after you use value_counts() on column 1
1 35
2 23
3 10
others 6 #3+2+1
#result after you use value_counts() on column 2
0 23
2 15
1 9
4 8
others 5 #3+2
Please, do both. Thanks.
I will showcase what I'd like to use myself at work with a 2-columned data. Limitation: simultaneous ties in the 2nd, 3rd, and 4th places are not collected into the same cell in this solution. You may have to further customize this behavior depending on your purpose.
There are 2 columns with 26 classes each. One column is categorical and the other one is numeric. The sample data is chosen deliberately to showcase the effect of ties.
import pandas as pd
import numpy as np
np.random.seed(2) # reproducibility
df = pd.DataFrame(np.random.randint(65, 91, (1000, 2)), columns=["str", "num"])
df["str"] = list(map(chr, df["str"].values))
print(df)
str num
0 I 80
1 N 73
2 W 76
3 S 76
4 I 72
.. .. ...
995 M 80
996 Q 70
997 P 66
998 I 87
999 F 83
[1000 rows x 2 columns]
def count_top_n(df, n_top):
# name of output columns
def gen_cols(ls_str):
for s in ls_str:
yield s
yield f"{s}_counts"
df_count = pd.DataFrame(np.zeros((n_top+1, df.shape[1]*2), dtype=object),
index=range(1, n_top+2),
columns=list(gen_cols(df.columns.values))) # df.shape[1] = #cols
# process each column
for i, col in enumerate(df):
# count
tmp = df[col].value_counts()
assert len(tmp) > n_top, f"ValueError: too few classes {len(tmp)} <= {n_top} = n_top)"
# case 1: no ties at the 3rd place
if tmp.iat[n_top - 1] != tmp.iat[n_top]:
# fill in classes
df_count.iloc[:n_top, 2*i] = tmp[:n_top].index.values
df_count.iloc[n_top, 2*i] = "(rest)"
# fill counts
df_count.iloc[:n_top, 2*i+1] = tmp[:n_top]
df_count.iloc[n_top, 2*i+1] = tmp[n_top:].sum()
# case 2: ties
else:
# new termination location
n_top_new = (tmp >= tmp.iat[n_top]).sum()
# fill in classes
df_count.iloc[:n_top-1, 2*i] = tmp.iloc[:n_top-1].index.values
df_count.iloc[n_top-1, 2*i] = list(tmp.iloc[n_top-1:n_top_new].index.values)
df_count.iloc[n_top, 2*i] = "(rest)"
# fill counts
df_count.iloc[:n_top-1, 2*i+1] = tmp.iloc[:n_top-1].values
df_count.iloc[n_top-1, 2*i+1] = list(tmp.iloc[n_top-1:n_top_new].values)
df_count.iloc[n_top, 2*i+1] = tmp.iloc[n_top_new:].values.sum()
return df_count
Output:
A human-readable table is generated. Note that there are ties in the 2nd,3rd,and 4th places for the column str
.
print(count_top_n(df, 3))
str str_count num num_count
1 V 52 71 51
2 Q 46 86 47
3 [B, K] [46, 46] [90, 67] [46, 46]
4 (rest) 810 (rest) 810