I have a dataframe with interacting chromosomal pairs denoted by chromosome (chr), and position (pos) as follows:
>>>import pandas as pd
>>>df1
chr1 pos1 chr2 pos2
chr1 54278 chr13 68798
chr1 32145 chr7 1248798
...
[162689366 rows x 4 columns]
In the real dataset, these are sorted by chr1, then pos1, chr2, pos2.
I have another dataset with the interacting pairs I wish to look at in the following format:
>>>df2
chr start stop comment
chr1 54275 55080 cluster-1
chr1 515523 515634 cluster-2
...
chr13 68760 70760
...
[69 rows x 4 columns]
I wish to subset df1 to include rows if and only if both interacting pairs (chr1-pos1 & chr2-pos2) are found within the range of start and stop values in df2.
In this example, the final dataframe would look something like this:
>>>df3
chr1 pos1 chr2 pos2
chr1 54278 chr13 68798
...
I have been trying to do this step wise (for the first chr-pos pair, then the second) using the .between function in pandas without any success. I've tried in both python2.7 and python3.6.
>>>df3 = df1[(df1['chr1'].isin(df2.chr)) & df1['pos1'].between(df1.pos1(df2.start),df1.pos1(df2.stop))]
This seems to work for the .isin but I get an error for the .between function. I think because the dataframes are not the same length, but I can't be sure.
>>>df1['pos1'].between(df2.start,df2.stop)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/dist-packages/pandas/core/series.py", line 2412, in between
lmask = self >= left
File "/usr/lib/python2.7/dist-packages/pandas/core/ops.py", line 699, in wrapper
raise ValueError('Series lengths must match to compare')
ValueError: Series lengths must match to compare
Any help is greatly appreciated!
Someone might have a more elegant solution, but in my head, I would join df2
to df1
twice, so that you get everything in one dataset, and comparisons are easy.
df2
is basically a lookup table, and df2.chr
should be matched to df1.chr1
and to df1.chr2
separately.
df_all = df1.merge(df2,
how='inner',
left_on='chr1',
right_on='chr') \
.merge(df2,
how='inner',
left_on='chr2',
right_on='chr',
suffixes=('_r1', '_r2'))
Note the suffixes. So pos1
will be tested to be in the start_r1
-stop_r1
range, and pos2
will be tested to be in the start_r2
-stop_r2
range.
df3 = df_all[(df_all['pos1'] \
.between(df_all['start_r1'], df_all['stop_r1'])) &
(df_all['pos2'] \
.between(df_all['start_r2'], df_all['stop_r2']))]
# Back to four original columns again
df3 = df3[['chr1', 'pos1', 'chr2', 'pos2']]