Search code examples
hadoopmapreducewikipediapagerank

using wikipedia dataset for pagerank in hadoop


I will be doing a project on pagerank and inverted indexing of wikipedia dataset using apache hadoop.I downloaded the whole wiki dump - http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 .It decompresses to a single 42 Gb .xml file. I want to somehow process this file to get data suitable for input in pagerank and inverted indexing map-reduce algos. Please help! Any leads will be helpful.


Solution

  • You need to write your own Inputformat to process XML. You would also need to implement a RecordReader to make sure your inputsplits have the fully formed XML chunk and not just a single line. See http://www.undercloud.org/?p=408 .