Search code examples
pythonscikit-learnclassificationwordstext-classification

Sklearn other inputs in addition to text for text classification


I am trying to do a text classifier using "Sci kit" learn bag of words. Vectorization into a classifier. However, I was wondering how would i add another variable to the input apart from the text itself. Say I want to add a number of words in the text in addition to text (because I think it may affect the result). How should I go about doing so?
Do I have to add another classifier on top of that one? Or is there a way to add that input to vectorized text?


Solution

  • Scikit learn classifiers works with numpy arrays. This means that after your vectorization of text, you can add your new features to this array easily (I am taking this sentence back, not very easily but doable). Problem is in text categorization, your features will be sparse therefore normal numpy column additions does not work.

    Code modified from text mining example from scikit learn scipy 2013 tutorial.

    from sklearn.datasets import load_files
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.naive_bayes import MultinomialNB
    import numpy as np
    import scipy
    
    # Load the text data
    
    twenty_train_subset = load_files('datasets/20news-bydate-train/',
        categories=categories, encoding='latin-1')
    
    # Turn the text documents into vectors of word frequencies
    vectorizer = TfidfVectorizer(min_df=2)
    X_train_only_text_features = vectorizer.fit_transform(twenty_train_subset.data)
    
    
    print type(X_train_only_text_features)
    print "X_train_only_text_features",X_train_only_text_features.shape
    
    size = X_train_only_text_features.shape[0]
    print "size",size
    
    ones_column = np.ones(size).reshape(size,1)
    print "ones_column",ones_column.shape
    
    
    new_column = scipy.sparse.csr.csr_matrix(ones_column )
    print type(new_column)
    print "new_column",new_column.shape
    
    X_train= scipy.sparse.hstack([new_column,X_train_only_text_features])
    
    print "X_train",X_train.shape
    

    output is following:

    <class 'scipy.sparse.csr.csr_matrix'>
    X_train_only_text_features (2034, 17566)
    size 2034
    ones_column (2034L, 1L)
    <class 'scipy.sparse.csr.csr_matrix'>
    new_column (2034, 1)
    X_train (2034, 17567)