Search code examples
stringawksedsubstringextract

How to extract symbol (<<) and its corresponding alphabets from a string with sed, awk or grep


DNA covariance model single/one file : Input data

Header : sequence and covariance

NC_013791.2.2 : GCTCAGCTGGCtAGAG
NC_013791.2.2 : >>>>.........<<<
NC_013791.2.3 : GCTCAGCTGGCtAGAG
NC_013791.2.3 : >>>>..<<<<......
NC_013791.2.4 : GCTCAGCTGGCtAGGA
NC_013791.2.4 : >>>>.........<<<
NC_013791.2.5 : GCTCAGCTGACtACAG
NC_013791.2.5 : >>>>..<<<<......

output data/expected data for all the above IDs from a single/one file

NC_013791.2.2 :  GAG
NC_013791.2.2 :  <<<
NC_013791.2.3 : CTGG
NC_013791.2.3 : <<<<
NC_013791.2.4 : GGA
NC_013791.2.4 : <<<
NC_013791.2.5 : CTGA
NC_013791.2.5 : <<<<
  1. I am able to delete last character with : sed 's/.$//' as suggested in stackflow

  2. extract last characters with : rev sym.txt | cut -c 1-3 | rev

  3. to extract only < with grep : grep -Eo "<.{3}" sym.txt

but i am not able to extract as below

GAG
<<<
GAGC
<<<<

or GAGC <<<<

Could someone help with sed, awk or grep - thank you in advance


Solution

  • If your data is always in this format, you can print the first 2 fields followed by the call to substr which will print the part of interest.

    Based on the answer provided by @stuffy, you could change the code to match 3 or more times a < char:

    awk 'match($0, /<<<+/) { 
      print $1, $2, substr(prev, RSTART, RLENGTH)
      print $1, $2, substr($0, RSTART, RLENGTH)
    } { 
      prev = $0
    }' file
    

    Here, the $0 is the current line, and prev is the previous line.

    The match function sets the predefined variables RSTART and RLENGTH that you can use for the call to substr

    Output

    NC_013791.2.2 : GAG
    NC_013791.2.2 : <<<
    NC_013791.2.3 : CTGG
    NC_013791.2.3 : <<<<
    NC_013791.2.4 : GGA
    NC_013791.2.4 : <<<
    NC_013791.2.5 : CTGA
    NC_013791.2.5 : <<<<
    

    If for example the field separator is : and you want to check that both parts before that are the same on both lines:

    awk -F" : " '
      match($2, /<<<+/) && key == $1 {
        print $1 FS substr(val, RSTART, RLENGTH)
        print $1 FS substr($2, RSTART, RLENGTH)
      }
      { val = $2; key = $1 }
    ' file