Search code examples
textawktrimfastq

Trimming first characters of a line to match length of a second line


I am playing with some fastq files trimming specific sequences from the 2nd line of the fastq sequence:

Input example:

@D00733:159:CA65UANXX:8:1214:11297:78554
GTTTTACACAATTATACGGACTTTATCCGCTTTTGTGCCTCTTTAATTTC
+
BBCCCEGGGGGGGFGEGGGDGGGGGGGGGGGGGGFGGGGGGGGGGGGEGG
@D00733:159:CA65UANXX:8:1214:11297:78555
TATGATTAGATGCGGATTGATCTGATCGGGACTGATTTTTTTTAGGGATT
+
BBCCCEGGGGGGGFGEGGGDGGGGGGGGGGGGGGFGGGGGGGGGGGGEGG

I trim from the sequence the subsequence 'TTATACGGACTTTATC' and anything that it is before of it with:

sed 's/^.*TTATACGGACTTTATC//' in.fastq > in2.fastq

The result looks like:

@D00733:159:CA65UANXX:8:1214:11297:78554
CGCTTTTGTGCCTCTTTAATTTC
+
BBCCCEGGGGGGGFGEGGGDGGGGGGGGGGGGGGFGGGGGGGGGGGGEGG
@D00733:159:CA65UANXX:8:1214:11297:78555
TATGATTAGATGCGGATTGATCTGATCGGGACTGATTTTTTTTAGGGATT
+
BBCCCEGGGGGGGFGEGGGDGGGGGGGGGGGGGGFGGGGGGGGGGGGEGG

Which could be an efficient way to trim the beginning of the 4th line in the entry (quality) to match the length of the 2nd one (sequence)? Each line is delimited by \n characters and each entry consists of 4 lines (identifier, sequence, +, quality).

Expected output:

@D00733:159:CA65UANXX:8:1214:11297:78554
CGCTTTTGTGCCTCTTTAATTTC
+
GGGGGGGFGGGGGGGGGGGGEGG
@D00733:159:CA65UANXX:8:1214:11297:78555
TATGATTAGATGCGGATTGATCTGATCGGGACTGATTTTTTTTAGGGATT
+
BBCCCEGGGGGGGFGEGGGDGGGGGGGGGGGGGGFGGGGGGGGGGGGEGG

Thanks in advance!


Solution

  • $ awk 'NR%4==2{s=match($0,/TTATACGGACTTTATC/)+RLENGTH} NR%4~/[02]/{$0=substr($0,s)} 1' file
    @D00733:159:CA65UANXX:8:1214:11297:78554
    CGCTTTTGTGCCTCTTTAATTTC
    +
    GGGGGGGFGGGGGGGGGGGGEGG
    @D00733:159:CA65UANXX:8:1214:11297:78555
    TATGATTAGATGCGGATTGATCTGATCGGGACTGATTTTTTTTAGGGATT
    +
    BBCCCEGGGGGGGFGEGGGDGGGGGGGGGGGGGGFGGGGGGGGGGGGEGG