Search code examples
awkparallel-processingxargsgnu-parallel

GNU parallel used with xargs and awk


I have two large tab separated files A.tsv and B.tsv, they look like (the header is not in the file):

A.tsv:  
ID AGE  
User1  18   
...

B.tsv:  
ID INCOME  
User4  49000  
...

I want to select list of IDs in A such that 10=< AGE <=20 and select rows in B that match the list. And I want to use GNU parallel tool. My attempt is two steps:

cat A.tsv | parallel --pipe -q awk '{ if ($3 >= 10 && $3 <= 20) print $1}' > list.tsv

cat list.tsv | parallel --pipe -q xargs -I% awk 'FNR==NR{a[$1];next}($1 in a)' % B.tsv > result.tsv

The first step works but the second one comes with error like:

awk: cannot open User1 (No such file or directory)

How can I fix this? Does this method work even if A.tsv and list.tsv are 2 to 3 times bigger than the memory?


Solution

  • $ for I in $(seq 8 2 22); do echo -e "User$I\t$I" >> A.txt; done; cat A.txt
    User8   8
    User10  10
    User12  12
    User14  14
    User16  16
    User18  18
    User20  20
    User22  22
    
    $ for I in $(seq 8 2 22); do echo -e "User$I\t100${I}00" >> B.txt; done; cat B.txt
    User8   100800
    User10  1001000
    User12  1001200
    User14  1001400
    User16  1001600
    User18  1001800
    User20  1002000
    User22  1002200
    
    $ cat A.txt | parallel --pipe -q awk '{if ($2 >= 10 && $2 <= 20) print $1}' > list.txt
    $ cat B.txt | parallel --pipe -q grep -f list.txt
    User10  1001000
    User12  1001200
    User14  1001400
    User16  1001600
    User18  1001800
    User20  1002000