python performance pandas numpy processing-efficiency

Pandas - Searching Column of Data Frame from List Efficiently

I am trying to figure out the most efficient way to search a data frame in Pandas with a list (dataframe) of other values without using brute force methods. Is there a way to vectorize it? I know I can for loop each element of the list (or dataframe) and extract the data using the loc method, but was hoping for something faster. I have a data frame with 1 million rows and I need to search within it to extract the index of 600,000 rows.

Example:

import pandas as pd
import numpy as np

df = pd.DataFrame({'WholeList': np.round(1000000*(np.random.rand(1000000)),0)})
df2 = pd.DataFrame({'ThingsToFind': np.arange(50000)+50000})
df.loc[1:10,:]
#Edited, now that I think about it, the 'arange' method would have been better to populate the arrays.

I want the most efficient way to get the index of df2 in df, where it exists in df.

Thanks!

Solution

Here's an approach with np.searchsorted as it seems the second dataframe has its elements sorted and unique -

def find_index(a,b, invalid_specifier = -1):
    idx = np.searchsorted(b,a)
    idx[idx==b.size] = 0
    idx[b[idx] != a] = invalid_specifier
    return idx

def process_dfs(df, df2):
    a = df.WholeList.values.ravel()
    b = df2.ThingsToFind.values.ravel()
    return find_index(a,b, invalid_specifier=-1)

Sample run on arrays -

In [200]: a
Out[200]: array([ 3,  5,  8,  4,  3,  2,  5,  2, 12,  6,  3,  7])

In [201]: b
Out[201]: array([2, 3, 5, 6, 7, 8, 9])

In [202]: find_index(a,b, invalid_specifier=-1)
Out[202]: array([ 1,  2,  5, -1,  1,  0,  2,  0, -1,  3,  1,  4])

Sample run on dataframes -

In [188]: df
Out[188]: 
    WholeList
0           3
1           5
2           8
3           4
4           3
5           2
6           5
7           2
8          12
9           6
10          3
11          7

In [189]: df2
Out[189]: 
   ThingsToFind
0             2
1             3
2             5
3             6
4             7
5             8
6             9

In [190]: process_dfs(df, df2)
Out[190]: array([ 1,  2,  5, -1,  1,  0,  2,  0, -1,  3,  1,  4])