DNA covariance model single/one file : Input data
Header : sequence and covariance
NC_013791.2.2 : GCTCAGCTGGCtAGAG
NC_013791.2.2 : >>>>.........<<<
NC_013791.2.3 : GCTCAGCTGGCtAGAG
NC_013791.2.3 : >>>>..<<<<......
NC_013791.2.4 : GCTCAGCTGGCtAGGA
NC_013791.2.4 : >>>>.........<<<
NC_013791.2.5 : GCTCAGCTGACtACAG
NC_013791.2.5 : >>>>..<<<<......
output data/expected data for all the above IDs from a single/one file
NC_013791.2.2 : GAG
NC_013791.2.2 : <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GGA
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGA
NC_013791.2.5 : <<<<
I am able to delete last character with : sed 's/.$//'
as suggested in stackflow
extract last characters with : rev sym.txt | cut -c 1-3 | rev
to extract only < with grep : grep -Eo "<.{3}" sym.txt
but i am not able to extract as below
GAG
<<<
GAGC
<<<<
or GAGC <<<<
Could someone help with sed, awk or grep - thank you in advance
If your data is always in this format, you can print the first 2 fields followed by the call to substr which will print the part of interest.
Based on the answer provided by @stuffy, you could change the code to match 3 or more times a <
char:
awk 'match($0, /<<<+/) {
print $1, $2, substr(prev, RSTART, RLENGTH)
print $1, $2, substr($0, RSTART, RLENGTH)
} {
prev = $0
}' file
Here, the $0
is the current line, and prev
is the previous line.
The match function sets the predefined variables RSTART
and RLENGTH
that you can use for the call to substr
Output
NC_013791.2.2 : GAG
NC_013791.2.2 : <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GGA
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGA
NC_013791.2.5 : <<<<
If for example the field separator is :
and you want to check that both parts before that are the same on both lines:
awk -F" : " '
match($2, /<<<+/) && key == $1 {
print $1 FS substr(val, RSTART, RLENGTH)
print $1 FS substr($2, RSTART, RLENGTH)
}
{ val = $2; key = $1 }
' file