This question is not completely the same as the greedy set cover problem, but they share the same idea.
Given a Pandas dataframe df1 with one column df['s'] composed of a set of keys of df2:
import numpy as np
import pandas as pd
>>> df = pd.DataFrame(np.array([set([1,3,5]), set([1,3,5,6]), set([2,3,4,12]), set([1,3,7]), set([1,15,11]), set([1,16]), set([16])]),columns=['s'])
>>> df
s
0 set([1, 3, 5])
1 set([1, 3, 5, 6])
2 set([12, 2, 3, 4])
3 set([1, 3, 7])
4 set([1, 11, 15])
5 set([1, 16])
6 set([16])
...
>>> df2 = pd.DataFrame(np.array([[1,2,3,3,3,6,4,8,9,10,11,12,13,14,15,16,5,7],[2.,1.,3.,2.,1.,2.,3.,1.,1.,1.,1.,1.,1.,1.,1.,16.,1.,1.]]).T,columns=['key', 'value'])
>>> df2
key value
0 1 2
1 2 1
2 3 3
3 3 2
4 3 1
5 6 2
6 4 3
7 8 1
8 9 1
9 10 1
10 11 1
11 12 1
12 13 1
13 14 1
14 15 1
15 16 16
16 5 1
17 7 1
...
The dataframe df2 above can contain duplicate keys. We choose the last one. For example, pick the value "1.0" for the key "3" above.
I want to find top six rows of df['s'] that can make the summation of the values of their corresponding keys in maximum, and sort the rows of the new dataframe by their value contribution. What is the fastest way to do this?
For the given data set above, the first two rows of the result dataframe should be
df3:
set([1,16])
set([12,2,3,4])
...
The second above is not set([16]), because "16" is already contained in set([1,16]), and the added value is zero from set([16]).
sorted by the summation of the corresponding values of keys of the set.
UPdate:
To make this problem simple, let's consider df2 contains only unique keys. And it can be easily fixed based on Andrew's trick.
Assuming you don't have too many keys, you could represent your list of sets as a sparse matrix, with a column for each key.
In [29]: df = pd.DataFrame([{1:1,3:1,5:1}, {1:1,3:1,5:1,6:1}, {2:1,3:1,4:1,12:1}, {1:1,3:1,7:1}, {1:1,15:1,11:1}, {9:1}, {16:1}]).fillna(0)
In [30]: df
Out[30]:
1 2 3 4 5 6 7 9 11 12 15 16
0 1 0 1 0 1 0 0 0 0 0 0 0
1 1 0 1 0 1 1 0 0 0 0 0 0
2 0 1 1 1 0 0 0 0 0 1 0 0
3 1 0 1 0 0 0 1 0 0 0 0 0
4 1 0 0 0 0 0 0 0 1 0 1 0
5 0 0 0 0 0 0 0 1 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0 0 1
And then represent your weights as a Series, indexed by key:
In [37]: weights = df2.drop_duplicates('key', keep='last').set_index('key')['value']
Then weight and sum your sets:
In [40]: totals = (df * weights).sum(axis=1)
In [41]: totals
Out[41]:
0 4
1 6
2 6
3 4
4 4
5 1
6 16
dtype: float64
And then just find the top 6 rows:
In [55]: top6 = totals.order(ascending=False).head(6)
In [56]: top6
Out[56]:
6 16
2 6
1 6
4 4
3 4
0 4
dtype: float64
You can use the indices back into the sparse matrix to recover which sets these were:
In [58]: df.ix[top6.index]
Out[58]:
1 2 3 4 5 6 7 9 11 12 15 16
6 0 0 0 0 0 0 0 0 0 0 0 1
2 0 1 1 1 0 0 0 0 0 1 0 0
1 1 0 1 0 1 1 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 1 0 1 0
3 1 0 1 0 0 0 1 0 0 0 0 0
0 1 0 1 0 1 0 0 0 0 0 0 0
You might not like this approach but I would point out having frames of data structures like sets rather than primitives as elements isn't particularly pandas-ish, so some translation of the problem is advised.