Search code examples
awksedtext-manipulationfastq

Extracting ID and sequence from a FASTQ file


I'm trying to manipulate a Fastq file. It looks like this:

@HWUSI-EAS610:1:1:3:1131#0/1
GATGCTAAGCCCCTAAGGTCATAAGACTGNNANGTC
+
B<ABA<;B@=4A9@:6@96:1??9;>##########
@HWUSI-EAS610:1:1:3:888#0/1
GATAGGACCAAACATCTAACATCTTCCCGNNGNTTC
+
B9>>ABA@B7BB:7?@####################
@HWUSI-EAS610:1:1:4:941#0/1
GCTTAGGAAGGAAGGAAGGAAGGGGTGTTCTGTAGT
+
BBBB:CB=@CB@?BA/@BA;6>BBA8A6A<?A4?B=
...
...
...
@HWUSI-EAS610:1:1:7:1951#0/1
TGATAGATAAGTGCCTACCTGCTTACGTTACTCTCC
+
BB=A6A9>BBB9B;B:B?B@BA@AB@B:74:;8=>7

My expected output is:

@HWUSI-EAS610:1:1:3:1131#0/1
GACNTNNCAGTCTTATGACCTTAGGGGCTTAGCATC
@HWUSI-EAS610:1:1:3:888#0/1
GAANCNNCGGGAAGATGTTAGATGTTTGGTCCTATC
@HWUSI-EAS610:1:1:4:941#0/1
ACTACAGAACACCCCTTCCTTCCTTCCTTCCTAAGC

So, the ID line are those starting with @HWUSI (i.e @HWUSI-EAS610:1:1:7:1951#0/1).. After each ID there is a line with its sequence. Now, I would like to obtain a file only with each ID and its correspondig sequence and the sequence should be reverse and complement. (A=T, T=A, C=G, G=C) With Sed I can obtain all the sequence reverse and complementary with the command

sed -n '2~4p' MYFILE.fq | rev | tr ATCG TAGC

How can I obtain also the corresponding ID?


Solution

  • With sed:

    sed -n '/@HWUSI/ { p; s/.*//; N; :a /\n$/! { s/\n\(.*\)\(.\)/\2\n\1/; ba }; y/ATCG/TAGC/; p }' filename
    

    This works as follows:

    /@HWUSI/ {                     # If a line starts with @HWUSI
      p                            # print it
      s/.*//                       # empty the pattern space
      N                            # fetch the sequence line. It is now preceded
                                   # by a newline in the pattern space. That is
                                   # going to be our cursor
      :a                           # jump label for looping
      /\n$/! {                     # while the cursor has not arrived at the end
        s/\n\(.*\)\(.\)/\2\n\1/    # move the last character before the cursor
        ba                         # go back to a. This loop reverses the sequence
      }
      y/ATCG/TAGC/                 # then invert it
      p                            # and print it.
    }
    

    I intentionally left the newline in there for more readable spacing; if that is not desired, replace the last p with a P (upper case instead of lower case). Where p prints the whole pattern space, P only prints the stuff before the first newline.