python pandas performance parallel-processing execution-time

Performance in select rows where condition is a percentage of match with a set

Given this sample:

df = pd.DataFrame({'col1':['id1','id2','id3'],
                  'col2':['name1','foobar','name3'],
                  'col3':[{'am', 'e1', 'me', 'na'},{'ar', 'ba', 'fo', 'ob', 'oo'},{'am', 'e3', 'me', 'na'}]})

    col1    col2    col3
0   id1     name1   {na, e1, me, am}
1   id2     foobar  {ar, fo, ba, oo, ob}
2   id3     name3   {na, e3, me, am}

The target is to subset df with all rows that satisfy a matching threshold of intersection of two sets.

My solution:

def subset_by_intersection_threshold(set_1, set_2, threshold):
    intersection = len(list(set_1.intersection(set_2)))
    union = (len(set_1) + len(set_2)) - intersection
    return float(intersection / union)>threshold

With a jaccard function and pandas apply filter by a threshold all rows that match a condition (0.4 of matching in this example) .

set_words=set(['na','me'])

df[df.col3.apply(lambda x: subset_by_intersection_threshold(set(x), set_words,0.4))]

As im feeling this solution is a little brute force mode, I open this question in order to learn more efficient alternatives considering execution time.

Adding the benchmark scores performed in my personal laptop:

From slower to faster:

%timeit df.col3.apply(lambda x: original(set(x), set_words, 0.4))  # 74 ms per loop
%timeit df.col3.apply(lambda x: jpp(x, set_words, 0.4))            # 32.3 ms per loop
%timeit list(map(lambda x: jpp(x, set_words, 0.4), df['col3']))    # 13.9 ms
%timeit [jpp(x, set_words, 0.4) for x in df['col3']]               # 12.2 ms

Solution

You can improve performance by a factor of ~2x by avoiding unnecessary list creation and float / set conversion. For an extra boost, index via a list of Boolean values, constructed using a list comprehension. As often the case, pd.Series.apply may underperform a regular loop inside a list comprehension.

def original(set_1, set_2, threshold):
    intersection = len(list(set_1.intersection(set_2)))
    union = (len(set_1) + len(set_2)) - intersection
    return float(intersection / union)>threshold

def jpp(set_1, set_2, threshold):
    intersection = len(set_1 & set_2)
    union = (len(set_1) + len(set_2)) - intersection
    return (intersection / union) > threshold

set_words = {'na', 'me'}

df = pd.concat([df]*10000)

%timeit df.col3.apply(lambda x: original(set(x), set_words, 0.4))  # 74 ms per loop
%timeit df.col3.apply(lambda x: jpp(x, set_words, 0.4))            # 32.3 ms per loop
%timeit [jpp(x, set_words, 0.4) for x in df['col3']]               # 23.4 ms per loop