Search code examples

Delete columns with extremely unequally distributed values from pandas dataframe

Given the following code:

import numpy as np
import pandas as pd

arr = np.array([
#    1,2,3,4,5,6 <- Number of the columns.
df = pd.DataFrame(arr)

for _ in df.columns.values:
    print {x: list(df[_]).count(x) for x in set(df[_])}

I want to delete from the dataframe all the columns in which one value occurs more often than all the other values of the column together. In this case I would like to drop the columns 4 and 6 (see comment) since the number 1 occurs more often than all the other numbers in these columns together (6 > 2 in column 4 and 7 > 1 in column 6). I don't want to drop the first column (4 = 4). How would I do that?


  • Another option is to do a value counts on each column and if the maximum of the count is smaller or equal to half of the number of rows of the data frame, then select it:

    df.loc[:, df.apply(lambda col: max(col.value_counts()) <= df.shape[0]/2)]
    #   0   1   2   4
    #0  1   2   9   1
    #1  2   3   3   0
    #2  1   4   2   2
    #3  2   3   1   2
    #4  1   2   3   8
    #5  2   2   5   1
    #6  1   3   8   4
    #7  2   4   7   3