Search code examples
pandasdataframeselectrows

How to mark or select rows in one dataframe, where value lies between any of the ranges in another dataframe featuring additional identifier


I am a bloody beginner and working with pandas and numpy for scientific dataframes.

I have 2 large dataframes, one with coordinates (geneX - position) and the other with ranges of coordinates (genes - start - end). I would like to select (or mark) all rows in the first dataset, were the coordinate falls into any of the ranges in the second dataframe.

For example:

import pandas as pd
import numpy as np

test = {'A': ['gene1','gene1','gene2','gene3','gene3','gene3','gene4'],
        'B': [1,11,21,31,41,51,61],
        'C': [10,20,30,40,50,60,70],
        'D': [4,64,25,7,36,56,7]}
  

df1 = pd.DataFrame(test, columns = ['A', 'D'])
df2 = pd.DataFrame(test, columns = ['A', 'B', 'C'])

This gives two dataframes looking like this:

df1:
    A       D   
0   gene1   4     <-- I want this row
1   gene1   64  
2   gene2   25    <-- and this row
3   gene3   7   
4   gene3   36    <-- and this row
5   gene3   56    <-- and this row
6   gene4   7 

df2:
    A       B   C
0   gene1   1   10 
1   gene1   11  20
2   gene2   21  30
3   gene3   31  40
4   gene3   41  50
5   gene3   51  60
6   gene4   61  70

I managed to come this far:

for i,j in zip(df2["B"],df2["C"]):
    x=df1[(df1["D"] >=i ) & (df1["D"] <= j)]
    df_final.append(x)
    

df_final=pd.concat(df_final,axis=0).reset_index(drop=True)
df_final=df_final.drop_duplicates()
df_final

But this only checks if the value in df1'D' is in any of the ranges in df2 but I fail to incorporate the "gene" identifier. Basically, I need it for each row in df1 to loop through df2 and first check if the gene matches, and if thats the case, check if the coordinate is in the data range.

Can anyone help me to figure this out?

Additional question: Can I make it, so it leaves the df1 intakt and just makes a new column with a "true" behind the rows that match the conditions directly? If not I would create a new df from the selected rows, add a column with the "true" label and then merge it back to the first one.

Thank you for your help. I really appreciate it !


Solution

  • Let us do in steps

    • Reset the index of df1
    • Merge df1 with df2 on column A (basically merge rows with same genes)
    • Query the merged dataframe to filter the rows where column D falls between B and C
    • Flag the rows by testing the membership of index of df1 in the index column of filtered rows
    m = df1.reset_index().merge(df2, on='A').query('B <= D <= C')
    df1['flag'] = df1.index.isin(m['index'])
    

    print(df1)
    
           A   D   flag
    0  gene1   4   True
    1  gene1  64  False
    2  gene2  25   True
    3  gene3   7  False
    4  gene3  36   True
    5  gene3  56   True
    6  gene4   7  False