I'm trying to manipulate a Fastq file. It looks like this:
@HWUSI-EAS610:1:1:3:1131#0/1
GATGCTAAGCCCCTAAGGTCATAAGACTGNNANGTC
+
B<ABA<;B@=4A9@:6@96:1??9;>##########
@HWUSI-EAS610:1:1:3:888#0/1
GATAGGACCAAACATCTAACATCTTCCCGNNGNTTC
+
B9>>ABA@B7BB:7?@####################
@HWUSI-EAS610:1:1:4:941#0/1
GCTTAGGAAGGAAGGAAGGAAGGGGTGTTCTGTAGT
+
BBBB:CB=@CB@?BA/@BA;6>BBA8A6A<?A4?B=
...
...
...
@HWUSI-EAS610:1:1:7:1951#0/1
TGATAGATAAGTGCCTACCTGCTTACGTTACTCTCC
+
BB=A6A9>BBB9B;B:B?B@BA@AB@B:74:;8=>7
My expected output is:
@HWUSI-EAS610:1:1:3:1131#0/1
GACNTNNCAGTCTTATGACCTTAGGGGCTTAGCATC
@HWUSI-EAS610:1:1:3:888#0/1
GAANCNNCGGGAAGATGTTAGATGTTTGGTCCTATC
@HWUSI-EAS610:1:1:4:941#0/1
ACTACAGAACACCCCTTCCTTCCTTCCTTCCTAAGC
So, the ID line are those starting with @HWUSI (i.e @HWUSI-EAS610:1:1:7:1951#0/1).. After each ID there is a line with its sequence. Now, I would like to obtain a file only with each ID and its correspondig sequence and the sequence should be reverse and complement. (A=T, T=A, C=G, G=C) With Sed I can obtain all the sequence reverse and complementary with the command
sed -n '2~4p' MYFILE.fq | rev | tr ATCG TAGC
How can I obtain also the corresponding ID?
With sed:
sed -n '/@HWUSI/ { p; s/.*//; N; :a /\n$/! { s/\n\(.*\)\(.\)/\2\n\1/; ba }; y/ATCG/TAGC/; p }' filename
This works as follows:
/@HWUSI/ { # If a line starts with @HWUSI
p # print it
s/.*// # empty the pattern space
N # fetch the sequence line. It is now preceded
# by a newline in the pattern space. That is
# going to be our cursor
:a # jump label for looping
/\n$/! { # while the cursor has not arrived at the end
s/\n\(.*\)\(.\)/\2\n\1/ # move the last character before the cursor
ba # go back to a. This loop reverses the sequence
}
y/ATCG/TAGC/ # then invert it
p # and print it.
}
I intentionally left the newline in there for more readable spacing; if that is not desired, replace the last p
with a P
(upper case instead of lower case). Where p
prints the whole pattern space, P
only prints the stuff before the first newline.