I have two files and I am trying to print unique sentences between both files. For this I am using difflib in python.
text ='Physics is one of the oldest academic disciplines. Perhaps the oldest through its inclusion of astronomy. Over the last two millennia. Physics was a part of natural philosophy along with chemistry.'
text1 ='Physics is one of the oldest academic disciplines. Physics was a part of natural philosophy along with chemistry. Quantum chemistry is a branch of chemistry.'
import difflib
differ = difflib.Differ()
diff = differ.compare(text,text1)
print '\n'.join(diff)
and its not giving me desired output. Its giving me like this.
P
h
y
s
i
c
s
i
s
o
n
e
o
f
t
h
e
My desired output is just unique sentences between both files.
text = Perhaps the oldest through its inclusion of astronomy. Over the last two millennia.
text1 = Quantum chemistry is a branch of chemistry.
Also it seems like difflib.Differ is going line by line not by sentences. Any suggestion please. How I can do that?
First, indeed, Differ().compare() compares lines, not sentences.
Second, it actually compares sequences, such as lists of strings. However, you pass two strings, not two lists of strings. Since a string is also a sequence (of characters), Differ().compare() in your case compares the individual characters.
If you want to compare files by sentences, you must prepare two lists of sentences. You can use nltk.sent_tokenize(text) to split a string into sentences.
diff = differ.compare(nltk.sent_tokenize(text),nltk.sent_tokenize(text1))
print('\n'.join(diff))
# Physics is one of the oldest academic disciplines.
#- Perhaps the oldest through its inclusion of astronomy.
#- Over the last two millennia.
# Physics was a part of natural philosophy along with chemistry.
#+ Quantum chemistry is a branch of chemistry.