Search code examples
pythonpython-2.7python-3.xpattern-matchingdifflib

find unique sentences in two files


I have two files and I am trying to print unique sentences between both files. For this I am using difflib in python.

text ='Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry.'
text1 ='Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry.'
import difflib

differ = difflib.Differ()
diff = differ.compare(text,text1)
print '\n'.join(diff)

and its not giving me desired output. Its giving me like this.

  P
  h
  y
  s
  i
  c
  s

  i
  s

  o
  n
  e

  o
  f

  t
  h
  e

My desired output is just unique sentences between both files.

text = Perhaps the oldest through its inclusion of astronomy. Over the last two millennia.

text1 = Quantum chemistry is a branch of chemistry.

Also it seems like difflib.Differ is going line by line not by sentences. Any suggestion please. How I can do that?


Solution

  • First, indeed, Differ().compare() compares lines, not sentences.

    Second, it actually compares sequences, such as lists of strings. However, you pass two strings, not two lists of strings. Since a string is also a sequence (of characters), Differ().compare() in your case compares the individual characters.

    If you want to compare files by sentences, you must prepare two lists of sentences. You can use nltk.sent_tokenize(text) to split a string into sentences.

    diff = differ.compare(nltk.sent_tokenize(text),nltk.sent_tokenize(text1))
    print('\n'.join(diff))
    #  Physics is one of the oldest academic disciplines.
    #- Perhaps the oldest through its inclusion of astronomy.
    #- Over the last two millennia.
    #  Physics was a part of natural philosophy along with chemistry.
    #+ Quantum chemistry is a branch of chemistry.