My command:
awk 'NR==FNR{a[$0]=1;next;} substr($0,50,6) in a' file1 file2
The problem is that file 2 contains \000
characters and awk consider it as binary file.
Replacing \000
with space character:
tr '\000' ' ' < file2 > file2_not_binary
solves binary file problem.
However my file2 is a 20GB file. And I don't want to do tr
separately and save result as another file. I want to pass the result of tr
to awk
.
I have tried:
awk 'NR==FNR{a[$0]=1;next;} substr($0,50,6) in a' file1 < (tr '\000' ' ' < file2)
But the result is:
The system cannot find the file specified.
Another question is: can my memory or awk handle such a big file at once? I'm working on 12GB RAM PC.
EDIT
One of the answer works as I expected (credits to Ed Morton)
tr '\000' ' ' < file2 | awk 'NR==FNR{a[$0];next} substr($0,50,6) in a' file1 -
However it is like 2 time slower then doing the same in 2 steps - first removing \000
and save it and then using awk
to search. How I can speed it up?
EDIT2
My bad. Ed Morton solution is actually a little bit faster then doing the same in two separately commands.
Two commands separately: 08:37:053
Two commands piped: 08:07:204
Since awk isn't storing your 2nd file in memory the size of that file is irrelevant except for speed of execution. Try this:
tr '\000' ' ' < file2 | awk 'NR==FNR{a[$0];next} substr($0,50,6) in a' file1 -