Search code examples
bashawkgrepcut

How to exclude lines in a file based on a range of values taken from a second file


I have a file with a list of value ranges:

2    4
6    9
13   14

and a second file that looks like this:

HiC_scaffold_1  1   26
HiC_scaffold_1  2   27
HiC_scaffold_1  3   27
HiC_scaffold_1  4   31
HiC_scaffold_1  5   34
HiC_scaffold_1  6   35
HiC_scaffold_1  7   37
HiC_scaffold_1  8   37
HiC_scaffold_1  9   38
HiC_scaffold_1  10  39
HiC_scaffold_1  11  39
HiC_scaffold_1  12  39
HiC_scaffold_1  13  39
HiC_scaffold_1  14  39
HiC_scaffold_1  15  42

and I would like to exclude rows from file 2 where the value of column 2 falls within a range defined by file 1. The ideal output would be:

HiC_scaffold_1  1   26
HiC_scaffold_1  5   34
HiC_scaffold_1  10  39
HiC_scaffold_1  11  39
HiC_scaffold_1  12  39
HiC_scaffold_1  15  42

I know how to extract a single range with awk:

awk '$2 == "2", $2 == "4"' file2.txt

but my file 1 has many many range values (lines) and I need to exclude rather than extract the rows that correspond to these values.


Solution

  • This is one awy:

    $ awk '
    NR==FNR {                           # first file
        min[NR]=$1                      # store mins and maxes in pairs
        max[NR]=$2
        next
    }
    {                                   # second file
        for(i in min)                   
            if($2>=min[i]&&$2<=max[i])
                next
    }1' ranges data
    

    Output:

    HiC_scaffold_1  1   26
    HiC_scaffold_1  5   34
    HiC_scaffold_1  10  39
    HiC_scaffold_1  11  39
    HiC_scaffold_1  12  39
    HiC_scaffold_1  15  42
    

    If the ranges are not huge and integer valued but the data is huge, you could make an exclude map of the values to speed up comparing:

    $ awk '
    NR==FNR {                       # ranges file
        for(i=$1;i<=$2;ex[i++]);    # each value in the range goes to exclude hash
        next
    }
    !($2 in ex)' ranges data        # print if not found in ex hash