Search code examples
pythondata-structuresmachine-learningspam-prevention

I need a neat data structure suggestion to store a very large dataset (to train Naive Bayes in Python)


I am going to implement Naive Bayes classifier with Python and classify e-mails as Spam or Not spam. I have a very sparse and long dataset with many entries. Each entry is like the following:

1 9:3 94:1 109:1 163:1 405:1 406:1 415:2 416:1 435:3 436:3 437:4 ...

Where 1 is label (spam, not spam), and each pair corresponds to a word and its frequency. E.g. 9:3 corresponds to the word 9 and it occurs 3 times in this e-mail sample.

I need to read this dataset and store it in a structure. Since it's a very big and sparse dataset, I'm looking for a neat data structure to store the following variables:

  • the index of each e-mail
  • label of it (1 or -1)
  • word and it's frequency per each e-mail
  • I also need to create a corpus of all words and their frequency with the label information

Any suggestions for such a data structure?


Solution

  • I would generate a class

    class Document(object):
    
        def __init__(self, index, label, bowdict):
            self.index = index
            self.label = label
            self.bowdict = bowdict
    

    You store your sparse vector in bowdict, eg

    { 9:3, 94:1, 109:1,  ... } 
    

    and hold all your data in a list of Documents

    To get an aggregation about all docs with a given label :

    from collections import defaultdict
    
    def aggregate(docs, label):
        bow = defaultdict(int)
        for doc in docs:
            if doc.label == label:
               for (word, counter) in doc.bowdict.items():
                    bow[word] += counter  
        return bow    
    

    You can persist all your data with the cPickle module.

    Another approach would be to use http://docs.scipy.org/doc/scipy/reference/sparse.html. You can represent a bow-vector as a sparse matrix with one row. If you want to aggregate bows you just have to add them up. This could be pretty faster than simple solution above.

    Further you could store all your sparse docs in one large matrix, where a Document instance holds a reference to the matrix, and a row index for the associated row.