Is there a good way to perform 'full outer join' multiple files on Unix ideally using GNU coreutils.
My files were produced by uniq -c
, below is a mock example the can be generated by:
echo "12 aa\n3 bb" > file1
echo "5 aa\n6 bb\n1 cc" > file2
echo "11 aa\n7 bb\n4 dd" > file3
and looks like:
tail -n +1 file*
==> file1 <==
12 aa
3 bb
==> file2 <==
5 aa
6 bb
1 cc
==> file3 <==
11 aa
7 bb
4 dd
I want to merge them one by one, using the sequence (column 2) as key filling in a 0 if the key is not included (outer join). I.e. the desired output would look like this
12 5 11 aa
3 6 6 bb
0 1 0 cc
0 0 4 dd
So far I found join
to be doing the job at least for pairwise merging:
join -j2 file1 file2 -a 2 -a 2 -e '0' -o '1.1,2.1,0' > merged
# 12 5 aa
# 3 6 bb
# 0 1 cc
Note: j2 : looking at second column for key (for both files) -a FILENUM: also print unpairable lines from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2
But I don't know how to generalize it for multiple files, i.e. this one doesn't work, which means I can't easily put it in a loop:
join -j2 merged file3 -a 2 -a 2 -e '0' -o '1.1,2.1,0' > merged2
I'd be ideally not want to use SQL to achive this, but would be ok if there wasn't any other way.
Finally figured out an efficient solution using sort
packed in the following bash script multi_join_from_uniq.sh
:
#!/bin/sh
join_rec() {
file1=$1
file2=$2
shift 2
if [ $# -gt 0 ]; then
join -e0 -a 1 -a 2 "$file1" -o auto "$file2" | join_rec - "$@"
else
join -e0 -a 1 -a 2 "$file1" -o auto "$file2"
fi
}
join_rec "$@"
However, the script only works only for sorted and swapped columns, which can be done as process substitution to make this example work:
sh multi_join_from_uniq.sh <(awk '{print $2,$1}' file1 | sort) <(awk '{print $2,$1}' file2 | sort ) <(awk '{print $2,$1}' file3 | sort )
resulting in the following output:
aa 12 5 11
bb 3 6 7
cc 0 1 0
dd 0 0 4
Of course, this can be also done when piping the the output of the uniq command e.g. this way cat data.txt | uniq -c | awk '{print $2,$1}' | sort > file1
or so.