Search code examples
pythonperformancepandasnumpyprocessing-efficiency

Pandas - Searching Column of Data Frame from List Efficiently


I am trying to figure out the most efficient way to search a data frame in Pandas with a list (dataframe) of other values without using brute force methods. Is there a way to vectorize it? I know I can for loop each element of the list (or dataframe) and extract the data using the loc method, but was hoping for something faster. I have a data frame with 1 million rows and I need to search within it to extract the index of 600,000 rows.

Example:

import pandas as pd
import numpy as np

df = pd.DataFrame({'WholeList': np.round(1000000*(np.random.rand(1000000)),0)})
df2 = pd.DataFrame({'ThingsToFind': np.arange(50000)+50000})
df.loc[1:10,:]
#Edited, now that I think about it, the 'arange' method would have been better to populate the arrays.

I want the most efficient way to get the index of df2 in df, where it exists in df.

Thanks!


Solution

  • Here's an approach with np.searchsorted as it seems the second dataframe has its elements sorted and unique -

    def find_index(a,b, invalid_specifier = -1):
        idx = np.searchsorted(b,a)
        idx[idx==b.size] = 0
        idx[b[idx] != a] = invalid_specifier
        return idx
    
    def process_dfs(df, df2):
        a = df.WholeList.values.ravel()
        b = df2.ThingsToFind.values.ravel()
        return find_index(a,b, invalid_specifier=-1)
    

    Sample run on arrays -

    In [200]: a
    Out[200]: array([ 3,  5,  8,  4,  3,  2,  5,  2, 12,  6,  3,  7])
    
    In [201]: b
    Out[201]: array([2, 3, 5, 6, 7, 8, 9])
    
    In [202]: find_index(a,b, invalid_specifier=-1)
    Out[202]: array([ 1,  2,  5, -1,  1,  0,  2,  0, -1,  3,  1,  4])
    

    Sample run on dataframes -

    In [188]: df
    Out[188]: 
        WholeList
    0           3
    1           5
    2           8
    3           4
    4           3
    5           2
    6           5
    7           2
    8          12
    9           6
    10          3
    11          7
    
    In [189]: df2
    Out[189]: 
       ThingsToFind
    0             2
    1             3
    2             5
    3             6
    4             7
    5             8
    6             9
    
    In [190]: process_dfs(df, df2)
    Out[190]: array([ 1,  2,  5, -1,  1,  0,  2,  0, -1,  3,  1,  4])