python pandas dataframe frequency categorical-data

Binarize a lot of features with condition

I have Pandas dataframe with hundreds of categoric features (in numbers). I want to leave only top values in columns. I do already know, that there are only 3 or 4 most frequent values in each column, but I want to select it automatically. I need two ways to do it:

1)leave only 3 most frequent values. Notion: there are no columns with 1, 2 or 3 unique values (~20 unique values in each column), so, do not consider it. If you have, for example, several third places, leave them all. For example:

#after you use value_counts() column 1
1 35
2 23
3 10
4 9
8 8
6 8

#after you use value_counts() on column 2
0 23
2 15
1 15 #two second places
4 9
5 3
6 2

#result after you use value_counts() on column 1
1 35
2 23
3 10
others 25 #9+8+8

#result after you use value_counts() on column 2
0 23
2 15
1 15
4 9
others 5 #3+2

2)leave as many values in each column as needed so that the number of remaining values is less than the number of the last values that you decided to leave. For example:

#after you use value_counts() column 1
1 35
2 23
3 10
4 3
8 2
6 1

#after you use value_counts() on column 2
0 23
2 15
1 9
4 8
5 3
6 2

#result after you use value_counts() on column 1
1 35
2 23
3 10
others 6 #3+2+1

#result after you use value_counts() on column 2
0 23
2 15
1 9
4 8
others 5 #3+2

Please, do both. Thanks.

Solution

I will showcase what I'd like to use myself at work with a 2-columned data. Limitation: simultaneous ties in the 2nd, 3rd, and 4th places are not collected into the same cell in this solution. You may have to further customize this behavior depending on your purpose.

Sample Data

There are 2 columns with 26 classes each. One column is categorical and the other one is numeric. The sample data is chosen deliberately to showcase the effect of ties.

import pandas as pd
import numpy as np

np.random.seed(2)  # reproducibility
df = pd.DataFrame(np.random.randint(65, 91, (1000, 2)), columns=["str", "num"])
df["str"] = list(map(chr, df["str"].values))

print(df)
    str  num
0     I   80
1     N   73
2     W   76
3     S   76
4     I   72
..   ..  ...
995   M   80
996   Q   70
997   P   66
998   I   87
999   F   83
[1000 rows x 2 columns]

Desired Function

def count_top_n(df, n_top):

    # name of output columns
    def gen_cols(ls_str):
        for s in ls_str:
            yield s
            yield f"{s}_counts"

    df_count = pd.DataFrame(np.zeros((n_top+1, df.shape[1]*2), dtype=object),
                            index=range(1, n_top+2),
                            columns=list(gen_cols(df.columns.values)))  # df.shape[1] = #cols
    # process each column
    for i, col in enumerate(df):
        # count
        tmp = df[col].value_counts()
        assert len(tmp) > n_top, f"ValueError: too few classes {len(tmp)} <= {n_top} = n_top)"

        # case 1: no ties at the 3rd place
        if tmp.iat[n_top - 1] != tmp.iat[n_top]:
            # fill in classes
            df_count.iloc[:n_top, 2*i] = tmp[:n_top].index.values
            df_count.iloc[n_top, 2*i] = "(rest)"
            # fill counts
            df_count.iloc[:n_top, 2*i+1] = tmp[:n_top]
            df_count.iloc[n_top, 2*i+1] = tmp[n_top:].sum()
        
        # case 2: ties
        else:
            # new termination location
            n_top_new = (tmp >= tmp.iat[n_top]).sum()
            # fill in classes
            df_count.iloc[:n_top-1, 2*i] = tmp.iloc[:n_top-1].index.values
            df_count.iloc[n_top-1, 2*i] = list(tmp.iloc[n_top-1:n_top_new].index.values)
            df_count.iloc[n_top, 2*i] = "(rest)"
            # fill counts
            df_count.iloc[:n_top-1, 2*i+1] = tmp.iloc[:n_top-1].values
            df_count.iloc[n_top-1, 2*i+1] = list(tmp.iloc[n_top-1:n_top_new].values)
            df_count.iloc[n_top, 2*i+1] = tmp.iloc[n_top_new:].values.sum()

    return df_count

Output:

A human-readable table is generated. Note that there are ties in the 2nd,3rd,and 4th places for the column str.

print(count_top_n(df, 3))
      str str_count       num num_count
1       V        52        71        51
2       Q        46        86        47
3  [B, K]  [46, 46]  [90, 67]  [46, 46]
4  (rest)       810    (rest)       810