Search code examples
unicodenlpspell-checkingpos-taggermorphological-analysis

A development platform for unicode spell checker?


I have decided to develop a (Unicode) spell checker for my final year project for a south Asian language. I want to develop it as a plugin or a web service. But I need to decide a suitable development platform for it. (This will not just check for a dictionary file, morphological analysis / generation modules (a stemmer) will also be used).

Would java script be able to handle such processing with a fair response time?

Will I be able to process a large dictionary on client side?

Is there any better suggestions that you can make?


Solution

  • Javascript is not up to the task, at least not by itself; its Unicode support is too primitive, and in many parts, actually missing. For example, Javascript has no support for Unicode grapheme clusters.

    If you use Java, then make sure you use the ICU libraries so that you can get all the whizbang Unicode properties you’ll need for text segmentation. The place where Java’s native Unicode processing breaks down is in its regex library, which is why Android JNIs over to the ICU C/C++ regex library. There are a lot of NLP tools written for Java, some of which you might find handy. Most of these that I am aware of though are for English or at least Western languages.

    If you are willing to run part of your computation server-side via CGI instead of just client-side action, you are no longer bound by language choice. For example, you might combine Javascript on the client with Perl on the server, whose Unicode support is even better than Java’s. How that would meld together and how to get the performance and behavior you would want depends on just what you actually want to do.

    Perl also has quite a good number of industry-standard NLP modules widely available for it, most of which already know to use Unicode, since like Java, Perl uses Unicode internally.

    A brief slide presentation on using NLP tools in Perl for certain sorts of morphological analysis, namely stemming and lammatization, is available here. The presentation is known to work under Safari, Firefox, or Chrome, but not so well under Opera or Microsoft’s Internet Explorer.

    I am not aware of any tools specifically targeting Asian languages, although Perl does support UAX#11 (East Asian Width) and UAX#14 (Unicode Linebreaking) via the Unicode::LineBreak module from CPAN, and Perl does come with a fully-compliant collation module (implementing UTS#10, the Unicocde Collation Algorithm) by way of the standard Unicode::Collate module, with locale support available from the also-standard Unicode::Collate::Locale module, where many Asian locales are supported. If you are using CJK languages, you may want access to the Unihan database, available via the Unicode::Unihan module from CPAN. Even more fundamentally, Perl has native support for Unicode extended grapheme clusters by way of its \X metacharacter in its builtin regex engine, which neither Java nor Javascript provides.

    All this is the sort of thing you are likely to need, and find terribly lacking, in Javascript.