Search code examples
apachehadoopmachine-learningmahout

How to create sequence files from tsv file for text classification


I have a tsv file which is seperated in class, id and text, e.g.

positive    2342    This is very good.
negative    4343    I hate it.

and I'm trying to feed Mahout's nbayes to classify the text part either pos or neg.

My first attempt was using mahout seqdirectory command on every line as a seperate file in its class directory. This works well with a small amount of data but eventually fails at around 30 Gigabytes of data with OutOfMemoryException. Increasing the heap size fails with "GC overhead limit exceeded" probably because of the large amount of seperate files.

My second attempt was loading the data into a hive table and convert it to a sequence file, as it is described here [0], which seems to work fine at first but after creating the vector file and splitting up the data set the trainnb step fails with an ArrayIndexOutOfBounds Exception.

[0] http://files.meetup.com/6195792/Working%20With%20Mahout.pdf

Right now I'm out of ideas what to look for. Any ideas how I can convert the tsv file or hive table to a sequencefile as it's generated by seqdirectory command on a directory?


Solution

  • Going to answer by myself in case some else needs a solution to the same or similar problem:

    I found this code snippet at github and modified it to my needs. Additionally I had to trim the value string to get proper results.