I want to use the NLTK chunker for Tamil language (which is an Indic language). However, it says that it doesn't support Unicode because it uses the 'pre' module for regular expressions.
Unresolved Issues
If we use the
re
module for regular expressions, Python's regular expression engine generates "maximum recursion depth exceeded" errors when processing very large texts, even for regular expressions that should not require any recursion. We therefore use thepre
module instead. But note thatpre
does not include Unicode support, so this module will not work with unicode strings.
Any suggestion for a work around or another way to accomplish it?
Chunkers are language-specific, so you need to train one for Tamil anyway. Of course if you are happy with available off-the-shelf solutions (I've got no idea if there are any, e.g. if the link in the now-deleted answer is any good), you can stop reading here. If not, you can train your own but you'll need a corpus that is annotated with the chunks you want to recognize: perhaps you are after NP chunks (the usual case), but maybe it's something else.
Once you have an annotated corpus, look carefully at chapters 6 and 7 of the NLTK book, and especially section 7.3, Developing and evaluating chunkers.. While Chapter 7 begins with the nltk's regexp chunker, keep reading and you'll see how to build a "sequence classifier" that does not rely on the nltk's regexp-based chunking engine. (Chapter 6 is essential for this, so don't skip it).
It's not a trivial task: You need to understand the classifier approach, put the pieces together, probably convert your corpus to IOB format, and finally select features that will give you satisfactory performance. But it is pretty straightforward, and can be carried out for any language or chunking task for which you have an annotated corpus. The only open-ended part is thinking up contextual cues that you can convert into features to help the classifier decide correctly, and experimenting until you find a good mix. (On the up side, it is a much more powerful approach than pure regexp-based solutions, even for ascii text).