I am trying to figure out the most efficient way to search a data frame in Pandas with a list (dataframe) of other values without using brute force methods. Is there a way to vectorize it? I know I can for loop each element of the list (or dataframe) and extract the data using the loc method, but was hoping for something faster. I have a data frame with 1 million rows and I need to search within it to extract the index of 600,000 rows.
Example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'WholeList': np.round(1000000*(np.random.rand(1000000)),0)})
df2 = pd.DataFrame({'ThingsToFind': np.arange(50000)+50000})
df.loc[1:10,:]
#Edited, now that I think about it, the 'arange' method would have been better to populate the arrays.
I want the most efficient way to get the index of df2 in df, where it exists in df.
Thanks!
Here's an approach with np.searchsorted
as it seems the second dataframe has its elements sorted and unique -
def find_index(a,b, invalid_specifier = -1):
idx = np.searchsorted(b,a)
idx[idx==b.size] = 0
idx[b[idx] != a] = invalid_specifier
return idx
def process_dfs(df, df2):
a = df.WholeList.values.ravel()
b = df2.ThingsToFind.values.ravel()
return find_index(a,b, invalid_specifier=-1)
Sample run on arrays -
In [200]: a
Out[200]: array([ 3, 5, 8, 4, 3, 2, 5, 2, 12, 6, 3, 7])
In [201]: b
Out[201]: array([2, 3, 5, 6, 7, 8, 9])
In [202]: find_index(a,b, invalid_specifier=-1)
Out[202]: array([ 1, 2, 5, -1, 1, 0, 2, 0, -1, 3, 1, 4])
Sample run on dataframes -
In [188]: df
Out[188]:
WholeList
0 3
1 5
2 8
3 4
4 3
5 2
6 5
7 2
8 12
9 6
10 3
11 7
In [189]: df2
Out[189]:
ThingsToFind
0 2
1 3
2 5
3 6
4 7
5 8
6 9
In [190]: process_dfs(df, df2)
Out[190]: array([ 1, 2, 5, -1, 1, 0, 2, 0, -1, 3, 1, 4])