I am a bloody beginner and working with pandas and numpy for scientific dataframes.
I have 2 large dataframes, one with coordinates (geneX - position) and the other with ranges of coordinates (genes - start - end). I would like to select (or mark) all rows in the first dataset, were the coordinate falls into any of the ranges in the second dataframe.
For example:
import pandas as pd
import numpy as np
test = {'A': ['gene1','gene1','gene2','gene3','gene3','gene3','gene4'],
'B': [1,11,21,31,41,51,61],
'C': [10,20,30,40,50,60,70],
'D': [4,64,25,7,36,56,7]}
df1 = pd.DataFrame(test, columns = ['A', 'D'])
df2 = pd.DataFrame(test, columns = ['A', 'B', 'C'])
This gives two dataframes looking like this:
df1:
A D
0 gene1 4 <-- I want this row
1 gene1 64
2 gene2 25 <-- and this row
3 gene3 7
4 gene3 36 <-- and this row
5 gene3 56 <-- and this row
6 gene4 7
df2:
A B C
0 gene1 1 10
1 gene1 11 20
2 gene2 21 30
3 gene3 31 40
4 gene3 41 50
5 gene3 51 60
6 gene4 61 70
I managed to come this far:
for i,j in zip(df2["B"],df2["C"]):
x=df1[(df1["D"] >=i ) & (df1["D"] <= j)]
df_final.append(x)
df_final=pd.concat(df_final,axis=0).reset_index(drop=True)
df_final=df_final.drop_duplicates()
df_final
But this only checks if the value in df1'D' is in any of the ranges in df2 but I fail to incorporate the "gene" identifier. Basically, I need it for each row in df1 to loop through df2 and first check if the gene matches, and if thats the case, check if the coordinate is in the data range.
Can anyone help me to figure this out?
Additional question: Can I make it, so it leaves the df1 intakt and just makes a new column with a "true" behind the rows that match the conditions directly? If not I would create a new df from the selected rows, add a column with the "true" label and then merge it back to the first one.
Thank you for your help. I really appreciate it !
Let us do in steps
df1
df1
with df2
on column A
(basically merge rows with same genes)D
falls between B
and C
index
of df1
in the index
column of filtered rowsm = df1.reset_index().merge(df2, on='A').query('B <= D <= C')
df1['flag'] = df1.index.isin(m['index'])
print(df1)
A D flag
0 gene1 4 True
1 gene1 64 False
2 gene2 25 True
3 gene3 7 False
4 gene3 36 True
5 gene3 56 True
6 gene4 7 False