Search code examples
linuxunixjoingnuouter-join

Full-outer-join for multiple files


Is there a good way to perform 'full outer join' multiple files on Unix ideally using GNU coreutils. My files were produced by uniq -c, below is a mock example the can be generated by:

echo "12 aa\n3 bb" > file1
echo "5 aa\n6 bb\n1 cc" > file2
echo "11 aa\n7 bb\n4 dd" > file3

and looks like:

tail -n +1 file*
==> file1 <==
12 aa
3 bb

==> file2 <==
5 aa
6 bb
1 cc

==> file3 <==
11 aa
7 bb
4 dd

I want to merge them one by one, using the sequence (column 2) as key filling in a 0 if the key is not included (outer join). I.e. the desired output would look like this

12 5 11 aa
3 6 6 bb
0 1 0 cc
0 0 4 dd

So far I found join to be doing the job at least for pairwise merging:

join -j2 file1 file2 -a 2 -a 2 -e '0' -o '1.1,2.1,0' > merged
# 12 5 aa
# 3 6 bb
# 0 1 cc

Note: j2 : looking at second column for key (for both files) -a FILENUM: also print unpairable lines from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2

But I don't know how to generalize it for multiple files, i.e. this one doesn't work, which means I can't easily put it in a loop:

join -j2 merged file3 -a 2 -a 2 -e '0' -o '1.1,2.1,0' > merged2

I'd be ideally not want to use SQL to achive this, but would be ok if there wasn't any other way.


Solution

  • Finally figured out an efficient solution using sort packed in the following bash script multi_join_from_uniq.sh:

    #!/bin/sh
    
    join_rec() {
      file1=$1
      file2=$2
      shift 2
      if [ $# -gt 0 ]; then
        join -e0 -a 1 -a 2 "$file1" -o auto "$file2" | join_rec - "$@"
      else
        join -e0 -a 1 -a 2 "$file1" -o auto "$file2"
      fi
    }
    
    join_rec "$@"
    

    However, the script only works only for sorted and swapped columns, which can be done as process substitution to make this example work:

    sh multi_join_from_uniq.sh <(awk '{print $2,$1}' file1 | sort) <(awk '{print $2,$1}' file2 | sort ) <(awk '{print $2,$1}' file3 | sort )
    

    resulting in the following output:

    aa 12 5 11
    bb 3 6 7
    cc 0 1 0
    dd 0 0 4
    

    Of course, this can be also done when piping the the output of the uniq command e.g. this way cat data.txt | uniq -c | awk '{print $2,$1}' | sort > file1 or so.