Given this sample:
df = pd.DataFrame({'col1':['id1','id2','id3'],
'col2':['name1','foobar','name3'],
'col3':[{'am', 'e1', 'me', 'na'},{'ar', 'ba', 'fo', 'ob', 'oo'},{'am', 'e3', 'me', 'na'}]})
col1 col2 col3
0 id1 name1 {na, e1, me, am}
1 id2 foobar {ar, fo, ba, oo, ob}
2 id3 name3 {na, e3, me, am}
The target is to subset df
with all rows that satisfy a matching threshold of intersection of two sets.
My solution:
def subset_by_intersection_threshold(set_1, set_2, threshold):
intersection = len(list(set_1.intersection(set_2)))
union = (len(set_1) + len(set_2)) - intersection
return float(intersection / union)>threshold
With a jaccard function and pandas apply
filter by a threshold all rows that match a condition (0.4 of matching in this example) .
set_words=set(['na','me'])
df[df.col3.apply(lambda x: subset_by_intersection_threshold(set(x), set_words,0.4))]
As im feeling this solution is a little brute force mode, I open this question in order to learn more efficient alternatives considering execution time.
From slower to faster:
%timeit df.col3.apply(lambda x: original(set(x), set_words, 0.4)) # 74 ms per loop
%timeit df.col3.apply(lambda x: jpp(x, set_words, 0.4)) # 32.3 ms per loop
%timeit list(map(lambda x: jpp(x, set_words, 0.4), df['col3'])) # 13.9 ms
%timeit [jpp(x, set_words, 0.4) for x in df['col3']] # 12.2 ms
You can improve performance by a factor of ~2x by avoiding unnecessary list
creation and float
/ set
conversion. For an extra boost, index via a list of Boolean values, constructed using a list comprehension. As often the case, pd.Series.apply
may underperform a regular loop inside a list comprehension.
def original(set_1, set_2, threshold):
intersection = len(list(set_1.intersection(set_2)))
union = (len(set_1) + len(set_2)) - intersection
return float(intersection / union)>threshold
def jpp(set_1, set_2, threshold):
intersection = len(set_1 & set_2)
union = (len(set_1) + len(set_2)) - intersection
return (intersection / union) > threshold
set_words = {'na', 'me'}
df = pd.concat([df]*10000)
%timeit df.col3.apply(lambda x: original(set(x), set_words, 0.4)) # 74 ms per loop
%timeit df.col3.apply(lambda x: jpp(x, set_words, 0.4)) # 32.3 ms per loop
%timeit [jpp(x, set_words, 0.4) for x in df['col3']] # 23.4 ms per loop