Search code examples
pythonnlpnltkspacysentence-similarity

Counting Sentences using NLTK (5400) and Spacy(5300) gives different answers. Need to know why?


I am new to NLP. Using Spacy and NLTK to count the sentences from JSON file but there is a big difference in both of the answers. I thought that the answers will be same. Anyone who can tell me that?? or any web link which will help me about this. Please I'm confused here


Solution

  • Sentence segmentation & tokenization are NLP subtasks, and each NLP library may have different implementations, leading to different error profiles.

    Even within the spaCy library there are different approaches: the best results are obtained by using the dependency parser, but a more simple rule-based sentencizer component also exists which is faster, but usually makes more mistakes (docs here).

    Because no implementation will be 100% perfect, you will get discrepancies between different methods & different libraries. What you can do, is print the cases in which the methods disagree, inspect these manually, and get a feel of which of the approaches works best for your specific domain & type of texts.