Search code examples
pythonpandasdataframeoptimizationmax

How to choose a combination that will maximize the number of data points chosen given that columns must match in value for all rows


Suppose I have an dataframe that looks like this:

A B C D E F G H 
1 0 1 0 1 1 1 1
0 1 1 1 1 0 0 0
1 0 1 0 0 0 0 0
1 1 0 1 1 1 1 1
1 1 1 1 1 1 1 1

I want to choose the maximum number of data points given the constraint that they all have 1's when the other selected columns have 1s. We can prune rows to get this result as well but the objective is to get as many data points as possible (where data point is defined as one item/cell in the dataframe)


Solution

  • The expected output is unclear, but you can use aggregation to identify the similar columns and count the total number of 1s:

    out = (df.T
       .assign(count=df.sum())
       .reset_index()
       .groupby(list(df.index))
       .agg({'index': list, 'count': 'sum'})
    )
    

    Output:

                   index  count
    0 1 2 3 4                  
    0 1 0 1 1     [B, D]      6
    1 0 0 1 1  [F, G, H]      9
    1 0 1 1 1        [A]      4
    1 1 0 1 1        [E]      4
    1 1 1 0 1        [C]      4
    

    You can then get the columns giving the max count:

    out.loc[out['count'].idxmax(), 'index']
    

    Output: ['F', 'G', 'H']