Search code examples
pythonpandasnumpygenome

Subset dataframe with another dataframe of a different length


I have a dataframe with interacting chromosomal pairs denoted by chromosome (chr), and position (pos) as follows:

>>>import pandas as pd

>>>df1
chr1     pos1     chr2     pos2
chr1    54278    chr13    68798
chr1    32145     chr7  1248798
... 
[162689366 rows x 4 columns]

In the real dataset, these are sorted by chr1, then pos1, chr2, pos2.

I have another dataset with the interacting pairs I wish to look at in the following format:

>>>df2
chr     start     stop     comment
chr1    54275    55080   cluster-1
chr1   515523   515634   cluster-2
...
chr13   68760    70760
...
[69 rows x 4 columns]

I wish to subset df1 to include rows if and only if both interacting pairs (chr1-pos1 & chr2-pos2) are found within the range of start and stop values in df2.

In this example, the final dataframe would look something like this:

>>>df3
chr1    pos1      chr2     pos2
chr1    54278    chr13    68798
...

I have been trying to do this step wise (for the first chr-pos pair, then the second) using the .between function in pandas without any success. I've tried in both python2.7 and python3.6.

>>>df3 = df1[(df1['chr1'].isin(df2.chr)) & df1['pos1'].between(df1.pos1(df2.start),df1.pos1(df2.stop))]

This seems to work for the .isin but I get an error for the .between function. I think because the dataframes are not the same length, but I can't be sure.

>>>df1['pos1'].between(df2.start,df2.stop)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/pandas/core/series.py", line 2412, in between
    lmask = self >= left
  File "/usr/lib/python2.7/dist-packages/pandas/core/ops.py", line 699, in wrapper
    raise ValueError('Series lengths must match to compare')
ValueError: Series lengths must match to compare

Any help is greatly appreciated!


Solution

  • Someone might have a more elegant solution, but in my head, I would join df2 to df1 twice, so that you get everything in one dataset, and comparisons are easy.

    df2 is basically a lookup table, and df2.chr should be matched to df1.chr1 and to df1.chr2 separately.

    df_all = df1.merge(df2,
                       how='inner',
                       left_on='chr1',
                       right_on='chr') \
                .merge(df2,
                       how='inner',
                       left_on='chr2',
                       right_on='chr',
                       suffixes=('_r1', '_r2'))
    

    Note the suffixes. So pos1 will be tested to be in the start_r1-stop_r1 range, and pos2 will be tested to be in the start_r2-stop_r2 range.

    df3 = df_all[(df_all['pos1'] \
                      .between(df_all['start_r1'], df_all['stop_r1'])) &
                 (df_all['pos2'] \
                      .between(df_all['start_r2'], df_all['stop_r2']))]
    
    # Back to four original columns again
    df3 = df3[['chr1', 'pos1', 'chr2', 'pos2']]