I need to parse several sorted files that have the following structure:
1_0.91 10
1_0.91 20
1_0.91 30
1_0.91 40
2_0.89 50
1_0.91 60
1_0.91 70
1_0.91 80
2_0.89 90
2_0.89 100
2_0.89 110
3_0.86 120
The first column is a feature in the genome, and the second column is their location. Each feature is interspaced with others. I want to find the size of each feature or blocks of each feature. For this example, my desired output is the following:
1_0.91 10 40 30
2_0.89 50 51 1
1_0.91 60 80 20
2_0.89 90 110 20
3_0.86 120 121 1
The feature 1_0.91 starts at 10 and is found on locations 20, 30, and 40. I want to create a new column with the start and end. In this case, starts at 10 and ends at 40. Then output their size in a new column (end minus start, in this case, 30). There are several places where I have each feature only once. In my example, 2_0.89 is between blocks of feature 1_0.91. In this case, I want to add 1 to the current value and estimate the size as well, which in this case, equals 1.
I have tried to use awk, but I am stuck with the features that appear only once.
Here is what I used so far. It is a bit convoluted: Let's call the first file file1.txt:
cat file1.txt | awk '$1!=prev{if (pline){print pline;}print;}{prev=$1;pline=$0;}END{print pline;}' > file2.txt
The output:
1_0.91 10
1_0.91 40
2_0.89 50
2_0.89 50
1_0.91 60
1_0.91 80
2_0.89 90
2_0.89 110
3_0.86 120
3_0.86 120
Now, I print the odd and even lines with sed, then I use paste to place the files together:
paste <(cat file2.txt | sed 'n; d') <(cat file2.txt | sed '1d; n; d' ) | awk 'BEGIN{OFS="\t"} {print $1,$2,$4}' > file3.txt
The output:
1_0.91 10 40
2_0.89 50 50
1_0.91 60 80
2_0.89 90 110
3_0.86 120 120
Next, I estimate the size of each feature:
cat file3.txt | awk 'BEGIN{OFS="\t"} {print $1,$2,$3,$3-$2}' > file4.txt
The output:
1_0.91 10 40 30
2_0.89 50 50 0
1_0.91 60 80 20
2_0.89 90 110 20
3_0.86 120 120 0
Next, I replace zeros in column 4 with 1:
cat file4.txt | awk 'BEGIN{OFS="\t"} { $4 = ($4 == "0" ? 1 : $4) } 1' > file5.txt
The output:
1_0.91 10 40 30
2_0.89 50 50 1
1_0.91 60 80 20
2_0.89 90 110 20
3_0.86 120 120 1
Finally, I fix the end of each feature with awk:
cat file5.txt | awk 'BEGIN{OFS="\t"} { $3 = ($2 == $3 ? $3+1 : $3) } 1' > file6.txt
The output:
1_0.91 10 40 30
2_0.89 50 51 1
1_0.91 60 80 20
2_0.89 90 110 20
3_0.86 120 121 1
I wonder if there was a faster and easy way to do it. Thank you.
Assumptions:
\t
One awk
idea:
awk '
function print_feature() {
if ( feature != "" )
print feature,min,max,(max-min)
}
BEGIN { FS=OFS="\t" }
$1 != feature { print_feature() # row contains a new/different feature, so print previous feature details
feature=$1
min=$2
max=min+1
next
}
{ max=$2 } # row contains a repeated/duplicate feature
END { print_feature() } # flush last feature details to stdout
' feature.dat
This generates:
1_0.91 10 40 30
2_0.89 50 51 1
1_0.91 60 80 20
2_0.89 90 110 20
3_0.86 120 121 1