Search code examples
pythonpandasperformanceoptimizationseries

How can i union two pd.Series efficiently


I have two sorted pd.Series like

A = [1, 3, 5, 7]
B = [3, 4, 5, 8, 10]

I'd like to union them to get a new list

C = [1, 3, 4, 5, 7, 8, 10]

The following code can solve it.

A = pd.Series([1, 3, 5, 7], name='col')
B = pd.Series([3, 4, 5, 8, 10], name='col')
pd.concat([A,B], axis=0).drop_duplicates().sort_values(ascending=True)

Or alternatively I can do

list(set(A).union(set(B))).sort()

My real problem has very huge arrays, and each of A1, A2, A3, A50 has 100k+ strings. And more than 99% elements are overlapping. The union operation will run 50 times.

Which solution is more time efficient? Do we have a even more efficient way to union them w/o using Cython or numba?


Solution

  • The set operation might be quite efficient, faster if you first convert to lists:

    sorted(set(A.tolist()).union(B.tolist()))
    

    Let's do some timings and add numpy.union1d in the comparison:

    # 100k+ strings. And more than 99% elements are overlapping
    A = pd.Series(map(str, range(100_000))).sample(frac=1)
    B = pd.Series(map(str, range(1000, 101_000))).sample(frac=1)
    
    %%timeit
    sorted(set(A).union(B))
    # 107 ms ± 4.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    %%timeit
    sorted(set(A.tolist()).union(B.tolist()))
    # 71 ms ± 4.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    %%timeit
    np.union1d(A, B).tolist()
    # 189 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    %%timeit
    pd.concat([A,B], axis=0).drop_duplicates().sort_values(ascending=True).tolist()
    # 152 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    Timings if the inputs are already sorted:

    # set union
    64.5 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    # set union from lists
    51.9 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    # numpy.union1d
    134 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    alternative: sortednp

    This alternative was mentioned as comment, it is indeed very efficient if you have numeric data, but unfortunately won't work with strings.

    import sortednp as snp
    
    A = pd.Series(range(100_000))
    B = pd.Series(range(1000, 101_000))
    
    out = snp.merge(A.to_numpy(), B.to_numpy(), duplicates=snp.DROP).tolist()
    

    Timing with numbers:

    %timeit snp.merge(A.to_numpy(), B.to_numpy(), duplicates=snp.DROP)
    # 552 µs ± 4.14 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    %timeit snp.merge(A.to_numpy(), B.to_numpy(), duplicates=snp.DROP).tolist()
    # 1.7 ms ± 149 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    %timeit sorted(set(A).union(B))
    # 17 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit sorted(set(A.tolist()).union(B.tolist()))
    # 13.6 ms ± 2.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    %timeit np.union1d(A, B).tolist()
    # 9.8 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit pd.concat([A,B], axis=0).drop_duplicates().sort_values(ascending=True).tolist()
    # 4.99 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)