Search code examples
pythonpandascategoriesone-hot-encodingdummy-variable

How to make dummy coding (pd.get_dummies()) only for categories which share in nominal variables is at least 40% in Python Pandas?


I have DataFrame like below:

COL1 | COL2 | COL3 | ...  | COLn
-----|------|------|------|----
111  | A    | Y    | ...  | ...
222  | A    | Y    | ...  | ...
333  | B    | Z    | ...  | ...
444  | C    | Z    | ...  | ...
555  | D    | P    | ...  | ...

And I need to do dummy coding (pandas.get_dummies()) only for categories which share in variable is at least 40%.

So, in COL2 only category "A" share in variable is at least 40%, in COL3 categories "Y" and "Z" share is at least 40% in variable.

So, as a result I need output like below:

COL1   | COL2_A  | COL3_Y  | COL_Z | ...  | COLn
-------|---------|---------|-------|------|-------
111    | 1       | 1       | 0     | ...  | ...
222    | 1       | 1       | 0     | ...  | ...
333    | 0       | 0       | 1     | ...  | ...
444    | 0       | 0       | 1     | ...  | ...
555    | 0       | 0       | 0     | ...  | ...

reproducible example of source df below:

df = pd.DataFrame()
df["COL1"] = [111,222,333,444,555]
df["COL2"] = ["A", "A", "B", "C", "D"]
df["COL3"] = ["Y", "Y", "Z", "Z", "P"]
df

Solution

  • Make dummy variables and remove some of them whose occurence ratio is smaller than 0.4 as follows.

    import pandas as pd
    
    df = pd.DataFrame()
    df["COL1"] = [111,222,333,444,555]
    df["COL2"] = ["A", "A", "B", "C", "D"]
    df["COL3"] = ["Y", "Y", "Z", "Z", "P"]
    
    origin_cols = df.columns
    df = pd.get_dummies(df)
    dummy_vars = [col for col in df.columns if (col.split('_')[0] in origin_cols) and ('_' in col)]
    
    cond = df[dummy_vars].sum() / len(df) < 0.4
    useless_cols = cond[cond].index
    df = df.drop(useless_cols, axis = 1)