Search code examples
pythonnumpypandasdata-analysis

Finding overlapping segments in Pandas


I have two pandas DataFrames A and B, with columns ['start', 'end', 'value'] but not the same number of rows. I'd like to set the values for each row in A as follows:

A.iloc(i) = B['value'][B['start'] < A[i,'start'] & B['end'] > A[i,'end']]

There is a possibility of multiple rows of B satisfy this condition for each i, in that case max or sum of corresponding rows would be the result. In case if none satisfies the value of A.iloc[i] should not be updated or set to a default value of 0 (either way would be fine)

I'm interested to find the most efficient way of doing this.


Solution

  • import numpy as np
    np.random.seed(1)
    lenB = 10
    lenA = 20
    B_start = np.random.rand(lenB)
    B_end   = B_start + np.random.rand(lenB)
    B_value = np.random.randint(100, 200, lenB)
    
    A_start = np.random.rand(lenA)
    A_end   = A_start + np.random.rand(lenA)
    
    #if you use dataframe
    #B_start = B["start"].values
    #B_end = ...
    
    mask = (A_start[:, None ] > B_start) & (A_end[:, None] < B_end)
    r, c = np.where(mask)
    result = pd.Series(B_value[c]).groupby(r).max()
    print result