Search code examples
linuxawksedgrepfastq

How to extract a specific text from gz file?


I need to extract the 5 to 11 characters from my fastq.gz data this data is just too large for running in R. So I was wondering if I can do it directly in Linux command line? The fastq file looks like this:

@NB501399:67:HFKTCBGX5:1:11101:13202:1044 1:N:0:CTTGTA
GAGGTNACGGAGTGGGTGTGTGCAGGGCCTGGTGGGAATGGGGAGACCCGTGGACAGAGCTTGTTAGAGTGTCCTAGAGCCAGGGGGAACTCCAGGCAGGGCAAATTGGGCCCTGGATGTTGAGAAGCTGGGTAACAAGTACTGAGAGAAC
+
    AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE6

@NB501399:67:HFKTCBGX5:1:11101:1109:1044 1:N:0:CTTGTA
TAGGCNACCTGGTGGTCCCCCGCTCCCGGGAGGTCACCATATTGATGCCGAACTTAGTGCGGACACCCGATCGGCATAGCGCACTACAGCCCAGAACTCCTGGACTCAAGCGATCCTCCAGCCTCAGCCTCCCGAGTAGCTGGGACTACAG
+

And I only want to extract the 5 to 11 character which located in sequence part (for the first one is TNACGG, for the second is CNACCT) and makes it a new txt file. Can I do that?


Solution

  • Another using zgrep and positive lookbehind:

    $ zgrep -oP "(?<=^[ACTGN]{4})[ACTGN]{6}" foo.gz
    TNACGG
    CNACCT
    

    Explained:

    • zgrep : man zgrep: search possibly compressed files for a regular expression
    • -o Print only the matched (non-empty) parts of a matching line
    • -P Interpret the pattern as a Perl-compatible regular expression (PCRE).
    • (?<=^[ACTGN]{4}) positive lookbehind
    • [ACTGN]{6} match 6 named characters that are preceeded by above
    • foo.gz my test file