Search code examples
bashgnu-coreutils

How to find set difference of two files?


I have two files A and B. I want to find all the lines in A that are not in B. What's the fastest way to do this in bash/using standard linux utilities? Here's what I tried so far:

for line in `cat file1`
 do
   if [ `grep -c "^$line$" file2` -eq 0]; then
   echo $line
   fi
 done

It works, but it's slow. Is there a faster way of doing this?


Solution

  • The BashFAQ describes doing exactly this with comm, which is the canonically correct method.

    # Subtraction of file1 from file2
    # (i.e., only the lines unique to file2)
    comm -13 <(sort file1) <(sort file2)
    

    diff is less appropriate for this task, as it tries to operate on blocks rather than individual lines; as such, the algorithms it has to use are more complex and less memory-efficient.

    comm has been part of the Single Unix Specification since SUS2 (1997).