Search code examples
linuxbashawkrangegenetics

How to compare 2 lists of ranges in bash?


Using bash script (Ubuntu 16.04), I'm trying to compare 2 lists of ranges: does any number in any of the ranges in file1 coincide with any number in any of the ranges in file2? If so, print the row in the second file. Here I have each range as 2 tab-delimited columns (in file1, row 1 represents the range 1-4, i.e. 1, 2, 3, 4). The real files are quite big.

file1:

1 4
5 7 
8 11
12 15

file2:

3 4 
8 13 
20 24

Desired output:

3 4 
8 13

My best attempt has been:

awk 'NR=FNR { x[$1] = $1+0; y[$2] = $2+0; next}; 
{for (i in x) {if (x[i] > $1+0); then
{for (i in y) {if (y[i] <$2+0); then            
{print $1, $2}}}}}' file1 file2 > output.txt

This returns an empty file.

I'm thinking that the script will need to involve range comparisons using if-then conditions and iterate through each line in both files. I've found examples of each concept, but can't figure out how to combine them.

Any help appreciated!


Solution

  • It depends on how big your files are, of course. If they are not big enough to exhaust the memory, you can try this 100% bash solution:

    declare -a min=() # array of lower bounds of ranges
    declare -a max=() # array of upper bounds of ranges
    
    # read ranges in second file, store then in arrays min and max
    while read a b; do
        min+=( "$a" );
        max+=( "$b" );
    done < file2
    
    # read ranges in first file    
    while read a b; do
        # loop over indexes of min (and max) array
        for i in "${!min[@]}"; do
            if (( max[i] >= a && min[i] <= b )); then # if ranges overlap
                echo "${min[i]} ${max[i]}" # print range
                unset min[i] max[i]        # performance optimization
            fi
        done
    done < file1
    

    This is just a starting point. There are many possible performance / memory footprint improvements. But they strongly depend on the sizes of your files and on the distributions of your ranges.

    EDIT 1: improved the range overlap test.

    EDIT 2: reused the excellent optimization proposed by RomanPerekhrest (unset already printed ranges from file2). The performance should be better when the probability that ranges overlap is high.

    EDIT 3: performance comparison with the awk version proposed by RomanPerekhrest (after fixing the initial small bugs): awk is between 10 and 20 times faster than bash on this problem. If performance is important and you hesitate between awk and bash, prefer:

    awk 'NR == FNR { a[FNR] = $1; b[FNR] = $2; next; }
        { for (i in a)
              if ($1 <= b[i] && a[i] <= $2) {
                  print a[i], b[i]; delete a[i]; delete b[i];
              } 
        }' file2 file1