I have a question that is quite similar to many other questions regarding this topic, yet I am unable to extent these solutions to the exact output I am looking for.
I have two files that are formatted in fastq style, which looks like this:
file1.txt
@header:with:id:number:0001 1:this:number:indicates:pair:number
ABCD
+
1324
@header:with:id:number:0001 2:this:number:indicates:pair:number
EFGH
+
5678
@header:with:id:number:0002 2:this:number:indicates:pair:number
PQRS
+
9012
@header:with:id:number:0003 1:this:number:indicates:pair:number
IJKL
+
3456
@header:with:id:number:0003 2:this:number:indicates:pair:number
MNOP
+
7890
file2.txt
@header:with:id:number:0004 1:this:number:indicates:pair:number
QRST
+
1324
@header:with:id:number:0004 2:this:number:indicates:pair:number
UVWX
+
5678
@header:with:id:number:0005 1:this:number:indicates:pair:number
CDEF
+
3456
@header:with:id:number:0005 2:this:number:indicates:pair:number
GHIJ
+
7890
@header:with:id:number:0002 1:this:number:indicates:pair:number
YZAB
+
9012
Every 'block' has four lines from which the first (the header) always starts with @ and include an id-number (e.g. 0001) and an index (i.e. 1 or 2 after a 'space'). Every id-number should occur twice in the same file with both indices (like this is true for all id-numbers except 0002 in the above example). Now I want to separately store the blocks whose id-number occurs in both files (indicating the blocks that occur only once in either file).
In this case the output should be:
@header:with:id:number:0002 1:this:number:indicates:pair:number
PQRS
+
9012
@header:with:id:number:0002 2:this:number:indicates:pair:number
YZAB
+
9012
and these lines should be removed from the original files.
For this I have so far used awk with the following command
awk -F" " '/^@/ && NR==FNR {lines[$1]; next}
$1 in lines {x=NR+3}
(NR<=x) {print $0}' file2.txt file1.txt
This outputs:
@header:with:id:number:0002 2:this:number:indicates:pair:number
PQRS
+
9012
which half way there.
My question is, how do I search for id-numbers in the headers that occur in both files, store them in a third file and remove the corresponding blocks from both original files?
Using GNU awk:
awk 'BEGIN {
RS="@header" # Set the input record separator
}
FNR==NR { # process the first file
ORS="@header"; # Set the output record separator
split($0,map,":"); # Split the record into array map using ":" as the delimiter
map1[substr(map[5],1,4)]=$0 # map[5] will be e.g 0002 2. We only want 0002 and so use substr to create an index for array map1 with the record as the value
}
NR!=FNR { # process the second file
ORS="@header";
split($0,map,":");
id=substr(map[5],1,4); # id e.g. 0002
if (id in map1) {
print $0; # If id in map1 array print this record
print map1[id] # if id in map1 array print array value
}
}' file1.txt file2.txt
One liner:
awk 'BEGIN { RS="@header" } FNR==NR { ORS="@header";split($0,map,":");map1[substr(map[5],1,4)]=$0 } NR!=FNR { ORS="@header";split($0,map,":");id=substr(map[5],1,4);if (id in map1) { print $0;print map1[id] } }' file1.txt file2.txt