Use some datatype as input for a MapReduce job.

I am working on a set of MapReduce jobs that convert a list of plot summaries into an index of every word mapping to the movieID and how many times it was used. I have one job that takes the input and creates a linked list of Nodes with the word, the movie it came from, and the number of times. My second job takes this LinkedList and uses the word as the key and the movieID and number of occurrences as the value and spits out an index of every word mapped to every movie it was used in as well as the number of occurrences.

When calling FileInputFormat.addInputPath() I can either use a Path() or a String with each element separated with a comma. It wouldn't be hard to just have a massive string with all the data my LinkedList holds but it would be nicer to just have the mapper use the LinkedList as input.

I have read about chaining MapReduce jobs so please don't give me a link to the Yahoo Developer page.

Solution

You don't need two MapReduce jobs (or therefore a LinkedList) here. We can instead treat this as a variant of word count, but with a list of movie IDs added into the output.

Map input:

354757\tToys come alive
432984\tMore toys

Map code:

String[] idAndWords = input.split("\\t");

for(String word : idAndWords[1].split(" ")) {
    //Do whatever cleansing you want here - remove punctuation, lowercase etc.
    output.write(word,idAndWords[0]);
}

Map output:

("toys","354757")
("come","354757")
("alive","354757")
("more","432984")
("toys","432984")

Reducer code:

//Maps movie IDs to occurrences
Map<String,Int> movieMap = new HashMap<>();
//In here is the list of movie IDs for each word
for(String val : values) {
    if(!movieMap.contains(val)) {
        movieMap.put(val,1);
    } else {
        movieMap.put(val,movieMap.get(val)+1);
    }
}
output.write(NullWritable.get(),key+"\t"+movieMap);

Reducer output:

toys\t[(3547571),(432984,1)]
come\t[(354757,1)]
alive\t[(354757,1)]
more\t[(432984,1)]

Now you have no need for a custom Writable, and it's fewer than a dozen lines of logic as opposed to what I think would be a pretty complex set of two chained MR jobs.

Efficiency extension:

You could make it more efficient by adding a count into the mapper output - with the current implementation then the plotline "dog eat dog" would result in the map output:

("dog","354757")
("eat","354757")
("dog","354757")

whereas you could reduce it to two records by adding a counter and scanning the whole line before outputting:

("dog","354757\t2")
("eat","354757\t1")

I didn't want to make my example more complex and less-readable by including this, but it should be trivial to implement and should give good performance savings.