Search code examples
pythonpython-2.7nlpnltkcountvectorizer

Python: how to turn list of word counts into format suitable for CountVectorizer


I have ~100,000 lists of strings of the form:
['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145'] etc.
which essentially makes up my corpus. Each list contains the words from a document and their word counts.

How can I put this corpus into a form that I can feed into CountVectorizer?

Is there a quicker way than turning each list into a string containing 'the' 652 times, 'of' 216 times, etc.?


Solution

  • Assuming that what you're trying to achieve is a vectorized corpus in sparse matrix format, along with a trained vectorizer, you can simulate the vectorization process without repeating the data:

    from scipy.sparse.lil import lil_matrix
    from sklearn.feature_extraction.text import CountVectorizer
    
    corpus = [['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145'],
              ['king: 20', 'of: 16', 'the: 400', 'jungle: 110']]
    
    
    # Prepare a vocabulary for the vectorizer
    vocabulary = {item.split(':')[0] for document in corpus for item in document}
    indexed_vocabulary = {term: index for index, term in enumerate(vocabulary)}
    vectorizer = CountVectorizer(vocabulary=indexed_vocabulary)
    
    # Vectorize the corpus using the coordinates known to the vectorizer
    X = lil_matrix((len(corpus), len(vocabulary)))
    X.data = [[int(item.split(':')[1]) for item in document] for document in corpus]
    X.rows = [[vectorizer.vocabulary[(item.split(':')[0])] for item in document]
              for document in corpus]
    
    # Convert the matrix to csr format to be compatible with vectorizer.transform output
    X = X.tocsr()
    

    In this example, the output will be:

    [[ 168.  216.    0.  159.  652.  145.    0.]
     [   0.   16.  110.    0.  400.    0.   20.]]
    

    This can allow further documents vectorization:

    vectorizer.transform(['jungle kid is programming', 'the jungle machine learning jungle'])
    

    Which yields:

    [[0 0 1 0 0 1 0]
     [0 0 2 0 1 0 0]]