Search code examples
pythondatabasedata-analysis

Analysis of data that cannot fit into memory


I have a database which has raw text that needs to be analysed. For example, I have collected the title tags of hundreds of millions of individual webpages and clustered them based on topic. I am now interested in performing some additional tests on subsets of each topic cluster. The problem is two-fold. First, I cannot fit all of the text into memory to evaluate it. Secondly, I need run several of these analyses in parallel, so even if I could fit a subset into memory, I certainly could not fit many subsets into memory.

I have been working with generators, but often it is necessary to know information about rows of data that have already been loaded and evaluated.

My question is this: What are the best methods for handling and analysing data that cannot fit into memory. The data necessarily must be extracted from some sort of database (currently mysql but likely will be switching to a more powerful solution soon.)

I am building the software that handles the data in Python.

Thank you,

EDIT

I will be researching and brainstorming on this all day and plan on continuing to post my thoughts and findings. Please leave any input or advice you might have.

IDEA 1: Tokenize words and n-grams and save to file. For each string pulled from database, tokenize using tokens in an already existing file. If a token does not exist, create it. For each word token, combine from right to left until a single representation of all the words in a string exists. Search an existing list (that can fit in memory) that consists of reduced tokens to find potential matches and similarities. Each reduced token will contain an identifier that indicates token categories. If a reduced token (one that was created by combination of word tokens) is found to match categorically against a tokenized string of interest, but not directly, then the reduced token will be broken down into its counterparts and compared word-token by word-token to the string of interest.

I have no idea if there already exists a library or module that can do this, nor am I sure how much benefit I will gain from it. However, my priorities are: 1) conserve memory, 2) worry about runtime. Thoughts?

EDIT 2

Hadoop is definitely going to be the solution to this problem. I found some great resources on natural language processing in python and hadoop. See below:

  1. http://www.cloudera.com/blog/2010/03/natural-language-processing-with-hadoop-and-python
  2. http://lintool.github.com/MapReduceAlgorithms/MapReduce-book-final.pdf
  3. http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python
  4. https://github.com/klbostee/dumbo/wiki/Short-tutorial

Thanks for your help!


Solution

  • Map/Reduce was created for this purpose.

    The best map reduce engine is Hadoop, but it has a high learning curve and needs many nodes for it to be worth it. If this is a small project, you could use MongoDB, which is a really easy to use database and includes an internal map reduce engine which uses Javascript. The map reduce framework is really simple and easy to learn, but it lacks all the tools that you could get in the JDK using Hadoop.

    WARNING: You can only run one map reduce job at a time on MongoDB's map reduce engine. This is alright for chaining jobs or medium datasets (<100GB), but it lacks Hadoop's parallelism.