I have a name.txt file of one column, e.g.
A
B
C
D
E
F
Then I have many files, e.g. x.txt, y.txt and z.txt
x.txt has
A 1
C 3
D 2
y.txt has
A 1
B 4
E 3
z.txt has
B 2
D 2
F 1
The desirable output is (filling in 0 if there is no mapping)
A 1 1 0
B 0 4 2
C 3 0 0
D 2 0 2
E 0 3 0
F 0 0 1
Is it possible to make it with bash? (perhaps awk?)
Many thanks!!!
first edits - my tentative efforts
Since I am quite new to bash, it is really hard for me to figure out a possible solution with awk. I'm more familiar with R, in which this can be accomplished by
namematrix[namematrix[,1]==xmatrix[,1],]
All in all, I really appreciate the kind help below helping me learn more about awk
and join
!
Second-time edits - a super efficient approach figured out!
Luckily inspired by some really brilliant answers below, I have sorted out a very computationally efficient way as below. This may be helpful to other people encountering similar questions, in particular if they deal with very large number of files with very large size.
Firstly touch a join_awk.bash
#!/bin/bash
join -oauto -e0 -a1 $1 $2 | awk '{print $2}'
For instance, execute this bash script for name.txt and x.txt
join_awk.bash name.txt x.txt
would generate
1
0
3
2
0
0
Note that here I only keep the second column to save disk space, because in my dataset the first columns are very long names that would take tremendous disk space.
Then simply implement
parallel join_awk.bash name.txt {} \> outdir/output.{} ::: {a,b,c}.txt
This is inspired by the brilliant answer below using GNU parallel and join. The difference is that the answer below has to specify j1
for parallel
due to its serial appending logic, which makes it not really "parallel". Also, the speed will become slower and slower as the serial appending continues. In contrast, here we manipulate each file separately in parallel. It can be extremely fast when we deal with large number of large-size files with multiple CPUs.
Finally just merge all the single-column output files together by
cd outdir
paste output* > merged.txt
This will be also very fast since paste
is inherently parallel.
You may use this awk
:
awk 'NF == 2 {
map[FILENAME,$1] = $2
next
}
{
printf "%s", $1
for (f=1; f<ARGC-1; ++f)
printf "%s", OFS map[ARGV[f],$1]+0
print ""
}' {x,y,z}.txt name.txt
A 1 1 0
B 0 4 2
C 3 0 0
D 2 0 2
E 0 3 0
F 0 0 1