Search code examples
pythonpandasdataframenumpyscipy

Python find first occurrence in Pandas dataframe column 2 below threshold and return column 1 value same row using NumPy


I have a dataframe as below:

0.1   0.65
0.2   0.664
0.3   0.606
0.4   0.587
0.5   0.602
0.6   0.59
0.7   0.53

I have to find the first occurence below 0.6 in column 2 and return the value of the column 1 on same row. In that example the returned value would be 0.4.

How could I do this using Numpy or SciPy ?

the code is:

import pandas as pd

df = pd.DataFrame([*zip([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7], [0.65, 0.664, 0.606 ,0.587 ,0.602,0.59,0.53])])

threshold = 0.6
var = df[df[1] < threshold].head(1)[0]
res = var.iloc[0]
    

Solution

  • You can use masking and the df.head() function to get the first occurrence given the threshold.

    df[df[1] < threshold].head(1)[0]
    
    3    0.4
    Name: 0, dtype: float64
    

    Update

    To use numpy, you need to convert the pandas to numpy and use np.where.

    array = df.values
    
    array[np.where(array[:,1] < 0.6)][0,0]
    0.4
    

    To compare the performance, we will time the two sets of codes.

    # Pandas style
    def function1(df):
        return df[df[1] < threshold].head(1)[0]
    
    # Numpy style
    def function2(df):
        array = df.values
    
        return array[np.where(array[:,1] < 0.6)][0,0]
    
    %timeit function1(df)
    322 µs ± 6.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    %timeit function2(df)
    11.8 µs ± 209 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)