I have a fastq file in which the first 8 lines look like this:
@SRR21388627.2845086/1
GCTGCAGTTGCTGCTGTTGCTGCTGCTGGGGCAGCACACCAGGATGGCCGGCGCCCCCG
+
FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FF
@SRR21388627.2707233/1
GCTGCAGTTGCTGCTGTTGCTGCTGCTGGGGCAGCACACCAGGATGGCCGGCGCCCCCG
+
FFFF:FF,:FFFF,FF,F:FFFFF:,F,,:,FF:,:,FFF:::F:,,FFF:::,FF:::
I would like to take the DNA sequence and save each sequence as a new file named with the line before the sequence, such as "SRR21388627.2845086.1.fq", where the @
is removed, and the /
is replaced with .
So far I came up with a command with reference from this post, which doesn't work yet, because I am not sure how to remove @
and replace /
in awk
cat deltaQ_1_region_1.fq | paste - - - - | cut -f1,2 |
awk -F'\t' '$1!=prev{close(out); out=$1".fq"; prev=$1} {sub(/[^\t]+\t/,""); print > out}' file
The expected output for SRR21388627.2845086.1.fq is:
GCTGCAGTTGCTGCTGTTGCTGCTGCTGGGGCAGCACACCAGGATGGCCGGCGCCCCCG
You have 4 lines per "record" so you can use a modulo 4 to differentiate the "fields" in awk
(instead of joining them with a paste
before calling awk):
awk '
NR%4 == 1 { sub("/","."); file_out = substr($0,2) ".fq"; next }
NR%4 == 2 { print > file_out; close(file_out) }
' file.fq