Search code examples
bashjoincomm

bash: Difference between join and comm


# comm -12 /tmp/src /tmp/txt | wc -l
  10338
# join /tmp/src /tmp/txt | wc -l
  10355

Both the files are single columns of alphanumeric strings and sort-ed. Shouldn't they be the same?


Updated following @Kevin-s answer below:

cat /tmp/txt | sed 's/^[:space:]*//' > /tmp/stxt
cat /tmp/src | sed 's/^[:space:]*//' > /tmp/ssrc

and the result:

#join /tmp/ssrc /tmp/stxt | wc -l
516
# comm -12 /tmp/ssrc /tmp/stxt | wc -l
513

On manual inspection of the diff-s ... the results differ due to some whitespaces that were not taken out by the sed.


Solution

  • I haven't used either extensively, but from a quick look at the man pages and test input, it seems that if the two files differ, comm prints both and join only prints matching lines. The -12 took care of that. You could store the output of the two into files and do a diff to see how they differ.

    $ echo -e '1\n2\n3\n5' > a
    $ echo -e '1\n2\n4\n5' > b
    $ comm a b
                    1
                    2
    3
            4
                    5
    $ join a b
    1
    2
    5
    $
    

    Edit: Join only compares the first whitespace-separated field but comm compares the whole line. Any whitespace on the line will therefore make the output differ.