Given the following code:
import numpy as np
import pandas as pd
arr = np.array([
[1,2,9,1,1,1],
[2,3,3,1,0,1],
[1,4,2,1,2,1],
[2,3,1,1,2,1],
[1,2,3,1,8,1],
[2,2,5,1,1,1],
[1,3,8,7,4,1],
[2,4,7,8,3,3]
])
# 1,2,3,4,5,6 <- Number of the columns.
df = pd.DataFrame(arr)
for _ in df.columns.values:
print {x: list(df[_]).count(x) for x in set(df[_])}
I want to delete from the dataframe all the columns in which one value occurs more often than all the other values of the column together. In this case I would like to drop the columns 4 and 6 (see comment) since the number 1 occurs more often than all the other numbers in these columns together (6 > 2 in column 4 and 7 > 1 in column 6). I don't want to drop the first column (4 = 4). How would I do that?
Another option is to do a value counts on each column and if the maximum of the count is smaller or equal to half of the number of rows of the data frame, then select it:
df.loc[:, df.apply(lambda col: max(col.value_counts()) <= df.shape[0]/2)]
# 0 1 2 4
#0 1 2 9 1
#1 2 3 3 0
#2 1 4 2 2
#3 2 3 1 2
#4 1 2 3 8
#5 2 2 5 1
#6 1 3 8 4
#7 2 4 7 3