Search code examples
awkfastq

awk; getting multiple lines from two files when they share a common header


I have a question that is quite similar to many other questions regarding this topic, yet I am unable to extent these solutions to the exact output I am looking for.

I have two files that are formatted in fastq style, which looks like this:

file1.txt

@header:with:id:number:0001 1:this:number:indicates:pair:number
ABCD
+
1324
@header:with:id:number:0001 2:this:number:indicates:pair:number
EFGH
+
5678
@header:with:id:number:0002 2:this:number:indicates:pair:number
PQRS
+
9012
@header:with:id:number:0003 1:this:number:indicates:pair:number
IJKL
+
3456
@header:with:id:number:0003 2:this:number:indicates:pair:number
MNOP
+
7890

file2.txt

@header:with:id:number:0004 1:this:number:indicates:pair:number
QRST
+
1324
@header:with:id:number:0004 2:this:number:indicates:pair:number
UVWX
+
5678
@header:with:id:number:0005 1:this:number:indicates:pair:number
CDEF
+
3456
@header:with:id:number:0005 2:this:number:indicates:pair:number
GHIJ
+
7890
@header:with:id:number:0002 1:this:number:indicates:pair:number
YZAB
+
9012

Every 'block' has four lines from which the first (the header) always starts with @ and include an id-number (e.g. 0001) and an index (i.e. 1 or 2 after a 'space'). Every id-number should occur twice in the same file with both indices (like this is true for all id-numbers except 0002 in the above example). Now I want to separately store the blocks whose id-number occurs in both files (indicating the blocks that occur only once in either file).

In this case the output should be:

@header:with:id:number:0002 1:this:number:indicates:pair:number
PQRS
+
9012
@header:with:id:number:0002 2:this:number:indicates:pair:number
YZAB
+
9012

and these lines should be removed from the original files.

For this I have so far used awk with the following command

awk -F" " '/^@/ && NR==FNR {lines[$1]; next}
    $1 in lines {x=NR+3}
    (NR<=x) {print $0}' file2.txt file1.txt

This outputs:

@header:with:id:number:0002 2:this:number:indicates:pair:number
PQRS
+
9012

which half way there.

My question is, how do I search for id-numbers in the headers that occur in both files, store them in a third file and remove the corresponding blocks from both original files?


Solution

  • Using GNU awk:

    awk 'BEGIN { 
                 RS="@header" # Set the input record separator
               } 
       FNR==NR { # process the first file
                 ORS="@header"; # Set the output record separator
                 split($0,map,":"); # Split the record into array map using ":" as the delimiter
                 map1[substr(map[5],1,4)]=$0 # map[5] will be e.g 0002 2. We only want 0002 and so use substr to create an index for array map1 with the record as the value
               } 
       NR!=FNR { # process the second file
                 ORS="@header";
                 split($0,map,":");
                 id=substr(map[5],1,4); # id e.g. 0002
                 if (id in map1) { 
                                   print $0; # If id in map1 array print this record
                                   print map1[id] # if id in map1 array print array value
                 } 
                }' file1.txt file2.txt
    

    One liner:

    awk 'BEGIN { RS="@header" } FNR==NR { ORS="@header";split($0,map,":");map1[substr(map[5],1,4)]=$0 } NR!=FNR { ORS="@header";split($0,map,":");id=substr(map[5],1,4);if (id in map1) { print $0;print map1[id] } }' file1.txt file2.txt