Search code examples
awkgnu-parallel

bash add/append new columns from other files


I have a name.txt file of one column, e.g.

A
B
C
D
E
F

Then I have many files, e.g. x.txt, y.txt and z.txt

x.txt has

A 1
C 3
D 2

y.txt has

A 1
B 4
E 3

z.txt has

B 2
D 2
F 1

The desirable output is (filling in 0 if there is no mapping)

A 1 1 0
B 0 4 2
C 3 0 0
D 2 0 2
E 0 3 0
F 0 0 1

Is it possible to make it with bash? (perhaps awk?)
Many thanks!!!


first edits - my tentative efforts
Since I am quite new to bash, it is really hard for me to figure out a possible solution with awk. I'm more familiar with R, in which this can be accomplished by

namematrix[namematrix[,1]==xmatrix[,1],]

All in all, I really appreciate the kind help below helping me learn more about awk and join!


Second-time edits - a super efficient approach figured out!

Luckily inspired by some really brilliant answers below, I have sorted out a very computationally efficient way as below. This may be helpful to other people encountering similar questions, in particular if they deal with very large number of files with very large size.

Firstly touch a join_awk.bash

#!/bin/bash
join -oauto -e0 -a1 $1 $2 | awk '{print $2}'

For instance, execute this bash script for name.txt and x.txt

join_awk.bash name.txt x.txt

would generate

1
0
3
2
0
0

Note that here I only keep the second column to save disk space, because in my dataset the first columns are very long names that would take tremendous disk space.

Then simply implement

parallel join_awk.bash name.txt {} \> outdir/output.{} ::: {a,b,c}.txt

This is inspired by the brilliant answer below using GNU parallel and join. The difference is that the answer below has to specify j1 for parallel due to its serial appending logic, which makes it not really "parallel". Also, the speed will become slower and slower as the serial appending continues. In contrast, here we manipulate each file separately in parallel. It can be extremely fast when we deal with large number of large-size files with multiple CPUs.

Finally just merge all the single-column output files together by

cd outdir
paste output* > merged.txt

This will be also very fast since paste is inherently parallel.


Solution

  • You may use this awk:

    awk 'NF == 2 {
       map[FILENAME,$1] = $2
       next
    }
    {
       printf "%s", $1
       for (f=1; f<ARGC-1; ++f)
          printf "%s", OFS map[ARGV[f],$1]+0
       print ""
    }' {x,y,z}.txt name.txt
    
    A 1 1 0
    B 0 4 2
    C 3 0 0
    D 2 0 2
    E 0 3 0
    F 0 0 1