python data-structures machine-learning spam-prevention

I need a neat data structure suggestion to store a very large dataset (to train Naive Bayes in Python)

I am going to implement Naive Bayes classifier with Python and classify e-mails as Spam or Not spam. I have a very sparse and long dataset with many entries. Each entry is like the following:

1 9:3 94:1 109:1 163:1 405:1 406:1 415:2 416:1 435:3 436:3 437:4 ...

Where 1 is label (spam, not spam), and each pair corresponds to a word and its frequency. E.g. 9:3 corresponds to the word 9 and it occurs 3 times in this e-mail sample.

I need to read this dataset and store it in a structure. Since it's a very big and sparse dataset, I'm looking for a neat data structure to store the following variables:

the index of each e-mail
label of it (1 or -1)
word and it's frequency per each e-mail
I also need to create a corpus of all words and their frequency with the label information

Any suggestions for such a data structure?

Solution

I would generate a class

class Document(object):

    def __init__(self, index, label, bowdict):
        self.index = index
        self.label = label
        self.bowdict = bowdict

You store your sparse vector in bowdict, eg

{ 9:3, 94:1, 109:1,  ... }

and hold all your data in a list of Documents

To get an aggregation about all docs with a given label :

from collections import defaultdict

def aggregate(docs, label):
    bow = defaultdict(int)
    for doc in docs:
        if doc.label == label:
           for (word, counter) in doc.bowdict.items():
                bow[word] += counter  
    return bow

You can persist all your data with the cPickle module.

Another approach would be to use http://docs.scipy.org/doc/scipy/reference/sparse.html. You can represent a bow-vector as a sparse matrix with one row. If you want to aggregate bows you just have to add them up. This could be pretty faster than simple solution above.

Further you could store all your sparse docs in one large matrix, where a Document instance holds a reference to the matrix, and a row index for the associated row.