Search code examples
javahadoopontologytext-miningwikimedia-dumps

How to improve the performance when working with wikipedia data and huge no. of webpages?


I am supposed to extract representative terms from an organisation's website using wikipedia's article-link data dump. To achieve this I've -

  1. Crawled & downloaded organisation's webpages. (~110,000)
  2. Created a dictionary of wikipedia ID and terms/title. (~40million records)

Now, I'm supposed to process each of the webpages using the dictionary to recognise terms and track their term IDs & frequencies.

For the dictionary to fit in memory, I've splitted the dictionary into smaller files. Based on my experiment with a small data-set, the processing time for the above will be around 75 days.

And this is just for 1 organisation. I have to do the same for more than 40 of them.

Implementation -

  • HashMap for storing dictionary in memory.
  • looping through each map entry to search the term in a webpage, using Boyer-Moore search implementation.
  • Repeating the above for each webpage, and storing results in a HashMap.

I've tried optimizing the code and tuning the JVM for better performance.

Can someone please advise on a more efficient way to implement the above, reducing the processing time to a few days.

Is Hadoop an option to consider?


Solution

  • Based on your question:

    Number of Documents = 110000
    
    Dictionary => List of [TermID, Title Terms] = 40million entries
    
    Size of documents = 11000 * 1KB per document on an average =  26.9GB
    (1KB per document on an average)
    
    Size of dictionary = 40million * 256bytes = 9.5GB of raw data
    (256bytes per entry on an average)
    

    How did you arrive at the 75 days estimate?

    There are number of performance targets:

    • How are you storing the Documents?
    • How are you storing/retrieving the Dictionary? ( assuming not all of it in memory unless you can afford to)
    • How many machines are you running it on?
    • Are you performing the dictionary lookups in parallel? ( of-course assuming dictionary is immutable once you have already processed whole of wikipedia )

    Here is an outline of what I believe you are doing:

    dictionary = read wikipedia dictionary
    document = a sequence of documents
    documents.map { doc =>
      var docTermFreq = Map[String, Int]()
      for(term <- doc.terms.map if(dictionary.contains(term)) ) {
         docTermFreq = docTermFreq + (term -> docTermFreq.getOrElse(term, 0) + 1)
      }
      // store docTermFreq map
    }
    

    What this is essentially doing is breaking up each document into tokens and then performing a lookup in wikipedia dictionary for its token's existence.

    This is exactly what a Lucene Analyzer does.

    A Lucene Tokenizer will convert document into tokens. This happens before the terms are indexed into lucene. So all you have to do is implement a Analyzer which can lookup the Wikipedia Dictionary, for whether or not a token is in dictionary.

    I would do it like this:

    • Take every document and prepare a token stream ( using an Analyzer described above )
    • Index the document terms.
    • At this point you will have wikipedia terms only, in the Lucene Index.

    When you do this, you will have ready-made statistics from the Lucene Index such as:

    There are lot of things you can do to improve the performance. For example:

    • Parallelize the document stream processing.
    • You can store the dictionary in key-value database such as BerkeylyDB or Kyoto Cabinet, or even an in-memory key-value storage such as Redis or Memcache.

    I hope that helps.