Search code examples
pythonpandasdataframebioinformatics

Find non-overlapping intervals within DNA coordinates


I am trying to find the non-overlapping intervals for the start/end DNA coordinates (on the same chromosome). I am having a hard time developing a function that takes into account two rows on the same exon. The non-overlapping interval must be unique (not overlap with any other intervals).

For ex, on the first row below, the non-overlapping interval would be 1-49, 61-100. But, if you look at the second row, the non-overlapping interval would be 1-69, 81-100. I want non-overlapping, non-overlapping intervals, so the true interval output I want is 1-49, 61-69, 81-100. Ideally, I would like these intervals to bee separated into their own columns (non-overlap_start, non-overlap_end).

*Please note: I have unique intervals per chromosome.

starting DF

chrom        exon_start    exon_end   start1      end1   
1                  1       100        50           60      
1                  1       100        70           80      
1                  150     155        155          160
2                  5       50         25           100     

final DF

chrom   exon_start    exon_end   start1   end1   non_overlap 
1           1             100    50      60      [1-49, 61-69, 81-100]        
1           1             100    70      80      [1-49, 61-69, 81-100] 
1          150             155  155      160      [150-154]
2           5             50     25      100     [5-24] 


Solution

  • Is this something you want?

    def find_non_overlap(part):
        minval = part.ex_start.min()
        maxval = part.ex_end.max()
        t = np.array([np.concatenate((
                # add additional True from each side
                np.ones(1 + row.ex_start - minval),
                np.zeros(row.overlap_start - row.ex_start), 
                np.ones(row.overlap_end - row.overlap_start + 1), 
                np.zeros(row.ex_end - row.overlap_end),
                np.ones(1 + maxval - row.ex_end), 
            )) for _, row in part.iterrows()])
        changes = minval + np.flatnonzero(np.diff(np.logical_or.reduce(t)))
        return list(zip(changes[::2], changes[1::2] - 1))
        # return pd.Series(
        #     {'non_overlap_starts': changes[::2], 
        #     'non_overlap_ends': changes[1::2] - 1}
        # )
    
    def calc_intervals(df):
        df = df.copy()
        df['overlap_start'] = df[['ex_start', 'start']].max(axis=1) 
        df['overlap_start'] = df[['overlap_start']].assign(
            t=df.ex_end + 1).min(axis=1)
        df['overlap_end'] = df[['ex_end', 'end']].min(axis=1) 
        df['overlap_end'] = df[['overlap_end']].assign(
            t=df.ex_start - 1).max(axis=1)
        return df.groupby(
            ['chrom', 'ex_start', 'ex_end']
        )[df.columns].apply(
            find_non_overlap, include_groups=False)
    
    df = pd.DataFrame([
        [0, 0, 100, 50, 60],
        [0, 0, 100, 20, 30],
        [0, 0, 100, 25, 40],
        [0, 0, 100, 90, 100],
        [0, 0, 100, 0, 5],
        [1, 10, 50, 5, 15],
        [2, 10, 50, 1, 5],
        [3, 10, 50, 30, 70],
        [4, 10, 50, 5, 55],
        [5, 10, 50, 20, 30],
        [6, 1, 100, 50, 60],
        [6, 1, 100, 70, 80],
        [6, 150, 155, 155, 160],
        [7, 5, 50, 25, 100],
        [8, 0, 5, 0, 1],
        [8, 0, 5, 3, 5],
        [9, 0, 5, 0, 4]
    ], 
    columns=['chrom', 'ex_start', 'ex_end', 'start', 'end'])
    
    calc_intervals(df)
    

    Result is:

    chrom  ex_start  ex_end
    0      0         100        [(6, 19), (41, 49), (61, 89)]
    1      10        50                            [(16, 50)]
    2      10        50                            [(10, 50)]
    3      10        50                            [(10, 29)]
    4      10        50                                    []
    5      10        50                  [(10, 19), (31, 50)]
    6      1         100       [(1, 49), (61, 69), (81, 100)]
           150       155                         [(150, 154)]
    7      5         50                             [(5, 24)]
    8      0         5                               [(2, 2)]
    9      0         5                               [(5, 5)]
    

    You can join it with your original dataframe if you need this values repeated in every row

    UPD: I have changed the algorithm and excluded assumptions that one chromosome has one ex_start-ex_end diapason and that the (start: end) interval is entirely included in the (ex_start: ex_end) interval. Also, I add more test examples. Check if I use the correct logic of overlapping calculations. Your initial examples are 6 and 7 chrom.

    Also if size of exons are very big, you can code values monotonically to make this algorithm much faster:

    df = pd.DataFrame([
        [0, 3, 1000000000, 50, 60],
        [0, 3, 1000000000, 10, 12],
        [1, 3, 10, 5, 30],
    ], columns=['chrom', 'ex_start', 'ex_end', 'start', 'end'])
    
    
    values = pd.concat([df.ex_start, df.ex_end, df.start, df.end])
    values = pd.concat(
        [values, values + 1, values - 1]
    ).sort_values().unique()
    code = dict(zip(values, np.arange(0, len(values))))
    decode = dict(zip(np.arange(0, len(values)), values))
    
    df_ = df.copy()
    df_ = df_.replace(code)
    
    pd.Series([
        [(decode[i],decode[j]) for (i, j) in t] 
        for t in calc_intervals(df_)])
    # It's much easier to decode inside find_non_overlap function
    

    Results:

    0    [(3, 9), (13, 49), (61, 1000000000)]
    1                                [(3, 4)]