Suppose I have an dataframe that looks like this:
A B C D E F G H
1 0 1 0 1 1 1 1
0 1 1 1 1 0 0 0
1 0 1 0 0 0 0 0
1 1 0 1 1 1 1 1
1 1 1 1 1 1 1 1
I want to choose the maximum number of data points given the constraint that they all have 1's when the other selected columns have 1s. We can prune rows to get this result as well but the objective is to get as many data points as possible (where data point is defined as one item/cell in the dataframe)
The expected output is unclear, but you can use aggregation to identify the similar columns and count the total number of 1s:
out = (df.T
.assign(count=df.sum())
.reset_index()
.groupby(list(df.index))
.agg({'index': list, 'count': 'sum'})
)
Output:
index count
0 1 2 3 4
0 1 0 1 1 [B, D] 6
1 0 0 1 1 [F, G, H] 9
1 0 1 1 1 [A] 4
1 1 0 1 1 [E] 4
1 1 1 0 1 [C] 4
You can then get the columns giving the max count:
out.loc[out['count'].idxmax(), 'index']
Output: ['F', 'G', 'H']