Search code examples
awkfastq

How to extract sequence from a fastq file and save as new file for each sequence


I have a fastq file in which the first 8 lines look like this:

@SRR21388627.2845086/1
GCTGCAGTTGCTGCTGTTGCTGCTGCTGGGGCAGCACACCAGGATGGCCGGCGCCCCCG
+
FFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF:FF
@SRR21388627.2707233/1
GCTGCAGTTGCTGCTGTTGCTGCTGCTGGGGCAGCACACCAGGATGGCCGGCGCCCCCG
+
FFFF:FF,:FFFF,FF,F:FFFFF:,F,,:,FF:,:,FFF:::F:,,FFF:::,FF:::

I would like to take the DNA sequence and save each sequence as a new file named with the line before the sequence, such as "SRR21388627.2845086.1.fq", where the @ is removed, and the / is replaced with .

So far I came up with a command with reference from this post, which doesn't work yet, because I am not sure how to remove @ and replace / in awk

cat deltaQ_1_region_1.fq | paste - - - - | cut -f1,2 | 
awk -F'\t' '$1!=prev{close(out); out=$1".fq"; prev=$1} {sub(/[^\t]+\t/,""); print > out}' file

The expected output for SRR21388627.2845086.1.fq is:

GCTGCAGTTGCTGCTGTTGCTGCTGCTGGGGCAGCACACCAGGATGGCCGGCGCCCCCG

Solution

  • You have 4 lines per "record" so you can use a modulo 4 to differentiate the "fields" in awk (instead of joining them with a paste before calling awk):

    awk '
        NR%4 == 1 { sub("/","."); file_out = substr($0,2) ".fq"; next }
        NR%4 == 2 { print > file_out; close(file_out) }
    ' file.fq