Search code examples
pythonpandasdataframefrequencycategorical-data

Binarize a lot of features with condition


I have Pandas dataframe with hundreds of categoric features (in numbers). I want to leave only top values in columns. I do already know, that there are only 3 or 4 most frequent values in each column, but I want to select it automatically. I need two ways to do it:

1)leave only 3 most frequent values. Notion: there are no columns with 1, 2 or 3 unique values (~20 unique values in each column), so, do not consider it. If you have, for example, several third places, leave them all. For example:

#after you use value_counts() column 1
1         35
2         23
3         10
4         9
8         8
6         8

#after you use value_counts() on column 2
0         23
2         15
1         15 #two second places
4         9
5         3
6         2

#result after you use value_counts() on column 1
1         35
2         23
3         10
others  25 #9+8+8

#result after you use value_counts() on column 2
0         23
2         15
1         15
4         9
others 5 #3+2

2)leave as many values in each column as needed so that the number of remaining values is less than the number of the last values that you decided to leave. For example:

#after you use value_counts() column 1
1         35
2         23
3         10
4         3
8         2
6         1

#after you use value_counts() on column 2
0         23
2         15
1         9
4         8
5         3
6         2

#result after you use value_counts() on column 1
1         35
2         23
3         10
others  6 #3+2+1

#result after you use value_counts() on column 2
0         23
2         15
1         9
4         8
others 5 #3+2

Please, do both. Thanks.


Solution

  • I will showcase what I'd like to use myself at work with a 2-columned data. Limitation: simultaneous ties in the 2nd, 3rd, and 4th places are not collected into the same cell in this solution. You may have to further customize this behavior depending on your purpose.

    Sample Data

    There are 2 columns with 26 classes each. One column is categorical and the other one is numeric. The sample data is chosen deliberately to showcase the effect of ties.

    import pandas as pd
    import numpy as np
    
    np.random.seed(2)  # reproducibility
    df = pd.DataFrame(np.random.randint(65, 91, (1000, 2)), columns=["str", "num"])
    df["str"] = list(map(chr, df["str"].values))
    
    print(df)
        str  num
    0     I   80
    1     N   73
    2     W   76
    3     S   76
    4     I   72
    ..   ..  ...
    995   M   80
    996   Q   70
    997   P   66
    998   I   87
    999   F   83
    [1000 rows x 2 columns]
    

    Desired Function

    def count_top_n(df, n_top):
    
        # name of output columns
        def gen_cols(ls_str):
            for s in ls_str:
                yield s
                yield f"{s}_counts"
    
        df_count = pd.DataFrame(np.zeros((n_top+1, df.shape[1]*2), dtype=object),
                                index=range(1, n_top+2),
                                columns=list(gen_cols(df.columns.values)))  # df.shape[1] = #cols
        # process each column
        for i, col in enumerate(df):
            # count
            tmp = df[col].value_counts()
            assert len(tmp) > n_top, f"ValueError: too few classes {len(tmp)} <= {n_top} = n_top)"
    
            # case 1: no ties at the 3rd place
            if tmp.iat[n_top - 1] != tmp.iat[n_top]:
                # fill in classes
                df_count.iloc[:n_top, 2*i] = tmp[:n_top].index.values
                df_count.iloc[n_top, 2*i] = "(rest)"
                # fill counts
                df_count.iloc[:n_top, 2*i+1] = tmp[:n_top]
                df_count.iloc[n_top, 2*i+1] = tmp[n_top:].sum()
            
            # case 2: ties
            else:
                # new termination location
                n_top_new = (tmp >= tmp.iat[n_top]).sum()
                # fill in classes
                df_count.iloc[:n_top-1, 2*i] = tmp.iloc[:n_top-1].index.values
                df_count.iloc[n_top-1, 2*i] = list(tmp.iloc[n_top-1:n_top_new].index.values)
                df_count.iloc[n_top, 2*i] = "(rest)"
                # fill counts
                df_count.iloc[:n_top-1, 2*i+1] = tmp.iloc[:n_top-1].values
                df_count.iloc[n_top-1, 2*i+1] = list(tmp.iloc[n_top-1:n_top_new].values)
                df_count.iloc[n_top, 2*i+1] = tmp.iloc[n_top_new:].values.sum()
    
        return df_count
    

    Output:

    A human-readable table is generated. Note that there are ties in the 2nd,3rd,and 4th places for the column str.

    print(count_top_n(df, 3))
          str str_count       num num_count
    1       V        52        71        51
    2       Q        46        86        47
    3  [B, K]  [46, 46]  [90, 67]  [46, 46]
    4  (rest)       810    (rest)       810