I have two sorted pd.Series
like
A = [1, 3, 5, 7]
B = [3, 4, 5, 8, 10]
I'd like to union them to get a new list
C = [1, 3, 4, 5, 7, 8, 10]
The following code can solve it.
A = pd.Series([1, 3, 5, 7], name='col')
B = pd.Series([3, 4, 5, 8, 10], name='col')
pd.concat([A,B], axis=0).drop_duplicates().sort_values(ascending=True)
Or alternatively I can do
list(set(A).union(set(B))).sort()
My real problem has very huge arrays, and each of A1, A2, A3, A50 has 100k+ strings. And more than 99% elements are overlapping. The union operation will run 50 times.
Which solution is more time efficient? Do we have a even more efficient way to union them w/o using Cython
or numba
?
The set operation might be quite efficient, faster if you first convert to lists:
sorted(set(A.tolist()).union(B.tolist()))
Let's do some timings and add numpy.union1d
in the comparison:
# 100k+ strings. And more than 99% elements are overlapping
A = pd.Series(map(str, range(100_000))).sample(frac=1)
B = pd.Series(map(str, range(1000, 101_000))).sample(frac=1)
%%timeit
sorted(set(A).union(B))
# 107 ms ± 4.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
sorted(set(A.tolist()).union(B.tolist()))
# 71 ms ± 4.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
np.union1d(A, B).tolist()
# 189 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
pd.concat([A,B], axis=0).drop_duplicates().sort_values(ascending=True).tolist()
# 152 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Timings if the inputs are already sorted:
# set union
64.5 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# set union from lists
51.9 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# numpy.union1d
134 ms ± 2.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
sortednp
This alternative was mentioned as comment, it is indeed very efficient if you have numeric data, but unfortunately won't work with strings.
import sortednp as snp
A = pd.Series(range(100_000))
B = pd.Series(range(1000, 101_000))
out = snp.merge(A.to_numpy(), B.to_numpy(), duplicates=snp.DROP).tolist()
Timing with numbers:
%timeit snp.merge(A.to_numpy(), B.to_numpy(), duplicates=snp.DROP)
# 552 µs ± 4.14 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit snp.merge(A.to_numpy(), B.to_numpy(), duplicates=snp.DROP).tolist()
# 1.7 ms ± 149 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit sorted(set(A).union(B))
# 17 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit sorted(set(A.tolist()).union(B.tolist()))
# 13.6 ms ± 2.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.union1d(A, B).tolist()
# 9.8 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.concat([A,B], axis=0).drop_duplicates().sort_values(ascending=True).tolist()
# 4.99 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)