Search code examples
awksumduplicates

Sum of a block of duplicated rows, if not duplicated add 1 and calculate difference


I need to parse several sorted files that have the following structure:

1_0.91  10
1_0.91  20
1_0.91  30
1_0.91  40
2_0.89  50
1_0.91  60
1_0.91  70
1_0.91  80
2_0.89  90
2_0.89  100
2_0.89  110
3_0.86  120

The first column is a feature in the genome, and the second column is their location. Each feature is interspaced with others. I want to find the size of each feature or blocks of each feature. For this example, my desired output is the following:

1_0.91  10  40  30
2_0.89  50  51  1
1_0.91  60  80  20
2_0.89  90  110 20
3_0.86  120 121 1

The feature 1_0.91 starts at 10 and is found on locations 20, 30, and 40. I want to create a new column with the start and end. In this case, starts at 10 and ends at 40. Then output their size in a new column (end minus start, in this case, 30). There are several places where I have each feature only once. In my example, 2_0.89 is between blocks of feature 1_0.91. In this case, I want to add 1 to the current value and estimate the size as well, which in this case, equals 1.

I have tried to use awk, but I am stuck with the features that appear only once.

Here is what I used so far. It is a bit convoluted: Let's call the first file file1.txt:

cat file1.txt | awk '$1!=prev{if (pline){print pline;}print;}{prev=$1;pline=$0;}END{print pline;}' > file2.txt

The output:

1_0.91  10
1_0.91  40
2_0.89  50
2_0.89  50
1_0.91  60
1_0.91  80
2_0.89  90
2_0.89  110
3_0.86  120
3_0.86  120

Now, I print the odd and even lines with sed, then I use paste to place the files together:

paste <(cat file2.txt | sed 'n; d') <(cat file2.txt | sed '1d; n; d' ) | awk 'BEGIN{OFS="\t"} {print $1,$2,$4}' > file3.txt

The output:

1_0.91  10  40
2_0.89  50  50
1_0.91  60  80
2_0.89  90  110
3_0.86  120 120

Next, I estimate the size of each feature:

cat file3.txt | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$3-$2}' > file4.txt

The output:

1_0.91  10  40  30
2_0.89  50  50  0
1_0.91  60  80  20
2_0.89  90  110 20
3_0.86  120 120 0

Next, I replace zeros in column 4 with 1:

cat file4.txt | awk 'BEGIN{OFS="\t"} { $4 = ($4 == "0" ? 1 : $4) } 1' > file5.txt

The output:

1_0.91  10  40  30
2_0.89  50  50  1
1_0.91  60  80  20
2_0.89  90  110 20
3_0.86  120 120 1

Finally, I fix the end of each feature with awk:

cat file5.txt | awk 'BEGIN{OFS="\t"} { $3 = ($2 == $3 ? $3+1 : $3) } 1' > file6.txt

The output:

1_0.91  10  40  30
2_0.89  50  51  1
1_0.91  60  80  20
2_0.89  90  110 20
3_0.86  120 121 1

I wonder if there was a faster and easy way to do it. Thank you.


Solution

  • Assumptions:

    • consecutive lines with the same feature (field #1) are sorted by location (field #2) in ascending order
    • input/output field delimiters are \t
    • location (field #2) values are always positive integers (otherwise we could tweak the code)

    One awk idea:

    awk '
    function print_feature() {
        if ( feature != "" )
           print feature,min,max,(max-min)
    }
    
    BEGIN         { FS=OFS="\t" }
    $1 != feature { print_feature()          # row contains a new/different feature, so print previous feature details
                    feature=$1
                    min=$2
                    max=min+1
                    next
                  }
                  { max=$2 }                 # row contains a repeated/duplicate feature
    END           { print_feature() }        # flush last feature details to stdout
    ' feature.dat
    

    This generates:

    1_0.91  10      40      30
    2_0.89  50      51      1
    1_0.91  60      80      20
    2_0.89  90      110     20
    3_0.86  120     121     1