I want to read Bengali texts in NLTK's CategorizedPlainCorpusReader. For this Snapshot of my Bengali text file in gedit text editor:
Snapshot of file in sublime text editor:
From the snapshots you can see the problem. The problem is Unicode composition problem (the dotted ring is a dead giveaway). And here is the code segment for reading texts:
>>> path = os.path.expanduser('~/nltk_data/corpora/Bangla')
>>> from nltk.corpus.reader import CategorizedPlaintextCorpusReader
>>> from nltk import RegexpTokenizer
>>> word_tokenize = RegexpTokenizer("[\w']+")
>>> reader = CategorizedPlaintextCorpusReader(path,r'.*\.txt',cat_pattern=r'(.*)_.*',word_tokenizer=word_tokenize)
>>> reader.sents(categories='pos')
The output is:
The output should be 'একবার' rather than 'একব' 'র'. What can be done?? Thanks in advance.
You need to provide the Unicode range for Bengali characters.
Use
word_tokenize = RegexpTokenizer("[\u0980-\u09FF']+")
The apostrophe can remain in the character class as is.