Search code examples
postgresqlhadoopmapreducebigdataelastic-map-reduce

Producing ngram frequencies for a large dataset


I'd like to generate ngram frequencies for a large dataset. Wikipedia, or more specifically, Freebase's WEX is suitable for my purposes.

What's the best and most cost efficient way to do it in the next day or so?

My thoughts are:

  • PostgreSQL using regex to split sentences and words. I already have the WEX dump in PostgreSQL, and I already have regex to do the splitting (major accuracy isn't required here)
  • MapReduce with Hadoop
  • MapReduce with Amazon's Elastic MapReduce, which I know next to nothing about

My experience with Hadoop consists of calculating Pi on three EC2 instances very very inefficiently. I'm good with Java, and I understand the concept of Map + Reduce. PostgreSQL I fear will take a long, long time, as it's not easily parallelisable.

Any other ways to do it? What's my best bet for getting it done in the next couple days?


Solution

  • Mapreduce will work just fine, and probably you could do most of the input-output shuffling by pig.

    See

    http://arxiv.org/abs/1207.4371

    for some algorithms.

    Of course, to make sure you get a running start, you don't actually need to be using mapreduce for this task; just split the input yourself, make the simplest fast program to calculate ngrams of a single input file and aggregate the ngram frequencies later.