Search code examples
linuxshellunixawkoverlapping

AWK: merge two files based on the overlapping range given in the files


Let me explain my problem using a dummy example. This is file A -

1 10 20 aa
2 30 40 bb
3 60 70 cc
. .. .. ..

and This is file B -

10 15 xx yy mm
21 29 mm nn ss
11 18 rr tt yy
69 90 qq ww ee
.. .. .. .. ..

I am trying to merge these files A and B such that there exist some overlapping between A's row and B's row.

Overlapping between A's row and B's row, in my case: there is something common between range starting from $2 to $3 for A's row and range starting from $1 to $2 for B's row. in above example, there is overlapping between range(10,20) and range(10,15). Here range(10,20) = [10,11,12,13,14,15,16,17,18,19] and range(10,15) = [10,11,12,13,14]

So the expected output is -

1 10 20 aa 10 15 xx
1 10 20 aa 11 18 rr
3 60 70 cc 69 90 qq

I tried this way (using and awk):

    for peak in State.peaks:
        i = peak[-1]
        peak = peak[:-1]
        a = peak[1]
        b = peak[2]
        d = State.delta
        c = ''' awk '{id=%d;delta=%d;a=%d;b=%d;x=%s;y=%s;if((x<=a&&y>a)||(x<=b&&y>b) || (x>a&&y<=b)) print id" "$7" "$3-$2} ' %s > %s ''' % (i, d, a, b, "$2-d", "$3+d", State.fourD, "file"+str(name))
        os.system(c)

Wanted to remove python part completely as it is taking much time.


Solution

  • This Awk script does the job:

    NR == FNR { record[NR] = $0; lo[NR] = $2; hi[NR] = $3; nrecs = NR; next }
    NR != FNR { # Overlap:  lo[A] < hi[B] && lo[B] < hi[A]
                for (i = 1; i <= nrecs; i++)
                {
                    if (lo[i] < $2 && $1 < hi[i])
                        print record[i], $1, $2, $3
                }
              }
    

    I saved it as range-merge-53.awk (53 is simply a random double-digit prime). I created file.A and file.B from your sample data, and ran:

    $ awk -f range-merge-53.awk file.A file.B
    1 10 20 aa 10 15 xx
    1 10 20 aa 11 18 rr
    3 60 70 cc 69 90 qq
    $
    

    The key is the 'overlap' condition, which must exclude the high value of each range — often denoted [lo..hi) for an open-closed range.

    It would be possible to omit either the next or the NR != FNR condition (but not both) and the code would work as well.

    See also Determine whether two date ranges overlap — the logic of ranges applies to dates and integers and floating point, etc.