Search code examples
pythondataframesample

Dataframe - Select the minimum set of rows to cover all possible values of each columns


I'm working on a use-case on which I need to retrieve a minimal sample of rows from a dataframe that contains at least one row for each unique value found in all columns.

A simplified example could be :

ID A B C
1 D G X
2 D G Y
3 E G Y
4 E H Y
5 F I Z

Here I would like to keep rows with ID 1, 4 and 5, in order to have at least one row with values D, E and F from A; G, H and I from B and X, Y and Z from C. I don't need to have all combinations, only all uniques values from every column:

ID A B C
1 D G X
4 E H Y
5 F I Z

What could be an efficient way to do this?

Thank you


Solution

  • Here is what I ended up going with:

    • Count the number of occurrences of every value in each column, and order those by count asc
    • Select a random row that contains the least occurring value of all, and add this value and all the others from the selected row into a dictionary
    • Looping through my ordered list of values, count the number of "new values" for each row that contains the current value of the list the loop is on. Pick the row with the highest number of "new values".
    • Repeat until all values have been picked

    Far from the optimal solution as I'm only minimizing the number of rows on each step, but it's pretty efficient and it gets the job done. Could have probably gone with a recursive algorithm to test all solutions but I don't have a big requirement on the final set of rows, except it needed all possible values.

    Thank you all for inputs, they helped me come up to this solution.