Reading Bengali with python Natural Language Toolkit

I want to read Bengali texts in NLTK's CategorizedPlainCorpusReader. For this Snapshot of my Bengali text file in gedit text editor:

Snapshot of file in sublime text editor:

From the snapshots you can see the problem. The problem is Unicode composition problem (the dotted ring is a dead giveaway). And here is the code segment for reading texts:

>>> path = os.path.expanduser('~/nltk_data/corpora/Bangla')
>>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader
>>> from nltk import RegexpTokenizer
>>> word_tokenize = RegexpTokenizer("[\w']+")
>>> reader = CategorizedPlaintextCorpusReader(path,r'.*\.txt',cat_pattern=r'(.*)_.*',word_tokenizer=word_tokenize)
>>> reader.sents(categories='pos')

The output is:

The output should be 'একবার' rather than 'একব' 'র'. What can be done?? Thanks in advance.

Solution

You need to provide the Unicode range for Bengali characters.

Use

word_tokenize = RegexpTokenizer("[\u0980-\u09FF']+")

The apostrophe can remain in the character class as is.