Search code examples
rubytextanalysis

Ruby Text Analysis


Is there any Ruby gem or else for text analysis? Word frequency, pattern detection and so forth (preferably with an understanding of french)


Solution

  • the generalization of word frequencies are Language Models, e.g. uni-grams (= single word frequency), bi-grams (= frequency of word pairs), tri-grams (=frequency of world triples), ..., in general: n-grams

    You should look for an existing toolkit for Language Models — not a good idea to re-invent the wheel here.

    There are a few standard toolkits available, e.g. from the CMU Sphinx team, and also HTK.

    These toolkits are typically written in C (for speed!! because you have to process huge corpora) and generate standard output format ARPA n-gram files (those are typically a text format)

    Check the following thread, which contains more details and links:

    Building openears compatible language model

    Once you generated your Language Model with one of these toolkits, you will need either a Ruby Gem which makes the language model accessible in Ruby, or you need to convert the ARPA format into your own format.

    adi92's post lists some more Ruby NLP resources.

    You can also Google for "ARPA Language Model" for more info

    Last not least check Google's online N-gram tool. They built n-grams based on the books they digitized — also available in French and other languages!