How to extract symbol (<<) and its corresponding alphabets from a string with sed, awk or grep

DNA covariance model single/one file : Input data

Header : sequence and covariance

NC_013791.2.2 : GCTCAGCTGGCtAGAG
NC_013791.2.2 : >>>>.........<<<
NC_013791.2.3 : GCTCAGCTGGCtAGAG
NC_013791.2.3 : >>>>..<<<<......
NC_013791.2.4 : GCTCAGCTGGCtAGGA
NC_013791.2.4 : >>>>.........<<<
NC_013791.2.5 : GCTCAGCTGACtACAG
NC_013791.2.5 : >>>>..<<<<......

output data/expected data for all the above IDs from a single/one file

NC_013791.2.2 :  GAG
NC_013791.2.2 :  <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GGA
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGA
NC_013791.2.5 : <<<<

I am able to delete last character with : sed 's/.$//' as suggested in stackflow
extract last characters with : rev sym.txt | cut -c 1-3 | rev
to extract only < with grep : grep -Eo "<.{3}" sym.txt

but i am not able to extract as below

GAG
<<<
GAGC
<<<<

or GAGC <<<<

Could someone help with sed, awk or grep - thank you in advance

Solution

If your data is always in this format, you can print the first 2 fields followed by the call to substr which will print the part of interest.

Based on the answer provided by @stuffy, you could change the code to match 3 or more times a < char:

awk 'match($0, /<<<+/) { 
  print $1, $2, substr(prev, RSTART, RLENGTH)
  print $1, $2, substr($0, RSTART, RLENGTH)
} { 
  prev = $0
}' file

Here, the $0 is the current line, and prev is the previous line.

The match function sets the predefined variables RSTART and RLENGTH that you can use for the call to substr

Output

NC_013791.2.2 : GAG
NC_013791.2.2 : <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GGA
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGA
NC_013791.2.5 : <<<<

If for example the field separator is : and you want to check that both parts before that are the same on both lines:

awk -F" : " '
  match($2, /<<<+/) && key == $1 {
    print $1 FS substr(val, RSTART, RLENGTH)
    print $1 FS substr($2, RSTART, RLENGTH)
  }
  { val = $2; key = $1 }
' file