Search code examples
pythonnlptext-processing

Reading Bengali with python Natural Language Toolkit


I want to read Bengali texts in NLTK's CategorizedPlainCorpusReader. For this Snapshot of my Bengali text file in gedit text editor:

enter image description here

Snapshot of file in sublime text editor:

enter image description here

From the snapshots you can see the problem. The problem is Unicode composition problem (the dotted ring is a dead giveaway). And here is the code segment for reading texts:

>>> path = os.path.expanduser('~/nltk_data/corpora/Bangla')
>>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader
>>> from nltk import RegexpTokenizer
>>> word_tokenize = RegexpTokenizer("[\w']+")
>>> reader = CategorizedPlaintextCorpusReader(path,r'.*\.txt',cat_pattern=r'(.*)_.*',word_tokenizer=word_tokenize)
>>> reader.sents(categories='pos')

The output is:

enter image description here

The output should be 'একবার' rather than 'একব' 'র'. What can be done?? Thanks in advance.


Solution

  • You need to provide the Unicode range for Bengali characters.

    Use

    word_tokenize = RegexpTokenizer("[\u0980-\u09FF']+")
    

    The apostrophe can remain in the character class as is.