Search code examples
javascriptnlpcoffeescriptuser-experiencetokenize

Basic NLP in CoffeeScript or JavaScript -- Punkt tokenizaton, simple trained Bayes models -- where to start?


My current web-app project calls for a little NLP:

  • Tokenizing text into sentences, via Punkt and similar;
  • Breaking down the longer sentences by subordinate clause (often it’s on commas except when it’s not)
  • A Bayesian model fit for chunking paragraphs with an even feel, no orphans or widows and minimal awkward splits (maybe)

... which much of that is a childishly easy task if you’ve got NLTK — which I do, sort of: the app backend is Django on Tornado; you’d think doing these things would be a non-issue.

However, I’ve got to interactively provide the user feedback for which the tokenizers are necessitated, so I need to do tokenize the data clientside.

Right now I actually am using NLTK, via a REST API call to a Tornado process that wraps the NLTK function and little else. At the moment, things like latency and concurrency are obviously suboptimal w/r/t this ad-hoc service, to put it politely. What I should be doing, I think, is getting my hands on Coffee/Java versions of this function if not reimplementing it myself.

And but so then from what I've seen, JavaScript hasn’t been considered cool long enough to have accumulated the not-just-web-specific, general-purpose library schmorgasbörd one can find in C or Python (or even Erlang). NLTK of course is a standout project by anyones’ measure but I only need a few percent of what it is packing.

But so now I am at a crossroads — I have to double down on either:

  • The “learning scientific JavaScript technique fit for reimplementing algorithms I am Facebook friends with at best” plan, or:
  • The less interesting but more deterministically doable “settle for tokenizing over the wire, but overcompensate for the dearth of speed and programming interestingness — ensure a beachball-free UX by elevating a function call into a robustly performant paragon of web-scale service architecture, making Facebook look like Google+” option.

Or something else entirely. What should I do? Like to start things off. This is my question. I’m open to solutions involving an atypical approach — as long as your recommendation is not distasteful (e.g. “use Silverlight”) and/or a time vortex (e.g. “get a computational linguistics PhD you troglodyte”) I am game. Thank you in advance.


Solution

  • I think that, as you wrote in the comment, the amount of data needed for efficient algorithms to run will eventually prevent you from doing things client-side. Even basic processing require lots of data, for instance bigram/trigram frequencies, etc. On the other hand, symbolic approaches also need significant data (grammar rules, dictionaries, etc.). From my experience, you can't run a good NLP process without at the very least 3MB to 5MB of data, which I think is too big for today's clients.

    So I would do things over the wire. For that I would recommend an asynchronous/push approach, maybe use Faye or Socket.io ? I'm sure you can achieve a perfect and fluid UX as long as the user is not stuck while the client is waiting for the server to process the text.