Search code examples
pythonpandasnumpyscipysparse-matrix

Adding Multiple Pandas Columns to Sparse CSR Matrix


so my question is based on this question.

I have Twitter data where I extracted unigram features and number of orthographies features such as excalamation mark, question mark, uppercase, and lowercase. I want to stack orthographies features into transformed unigram feature. Here is my code:

X_train, X_test, y_train, y_test = train_test_split(tweet_df[['tweets', 'exclamation', 'question', 'uppercase', 'lowercase']], tweet_df['class'], stratify=tweet_df['class'],
                                 test_size = 0.2, random_state=0)

count_vect = CountVectorizer(ngram_range=(1,1))
X_train_gram = count_vect.fit_transform(X_train['tweets'])

tfidf = TfidfTransformer()
X_train_gram = tfidf.fit_transform(X_train_gram)

X_train_gram = hstack((X_train_gram,np.array(X_train['exclamation'])[:,None]))

This worked, however I can't find a way to incorporate the rest of columns (question, uppercase, lowercase) into the stack in one line of code. Here is the failed try:

X_train_gram = hstack((X_train_gram,np.array(list(X_train['exclamation'], X_train['question'], X_train['uppercase'], X_train['lowercase']))[:,None])) #list expected at most 1 arguments, got 4

X_train_gram = hstack((X_train_gram,np.array(X_train[['exclamation', 'question', 'uppercase', 'lowercase']])[:,None])) #expected dimension <= 2 array or matrix

X_train_gram = hstack((X_train_gram,np.array(X_train[['exclamation', 'question', 'uppercase', 'lowercase']].values)[:,None])) #expected dimension <= 2 array or matrix

Any help appreciated.


Solution

  • You have problems with list syntax and sparse.coo_matrix creation.

    np.array(X_train['exclamation'])[:,None])
    

    Series to array is 1d, with None, becomes (n,1)

    np.array(list(X_train['exclamation'], X_train['question'], X_train['uppercase'], X_train['lowercase']))[:,None]
    

    That's not valid list syntax:

    In [327]: list(1,2,3,4)                                                         
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-327-e06d60ac583e> in <module>
    ----> 1 list(1,2,3,4)
    
    TypeError: list() takes at most 1 argument (4 given)
    

    next:

    np.array(X_train[['exclamation', 'question', 'uppercase', 'lowercase']])[:,None])
    

    With multiple columns, we get a DataFrame; which makes a 2d array; add the None, and get a 3d array:

    In [328]: np.ones((2,3))[:,None].shape                                          
    Out[328]: (2, 1, 3)
    

    Can't make a coo matrix from a 3d array. Adding values doesn't change things. np.array(dataframe) is the same as dataframe.values.

    np.array(X_train[['exclamation', 'question', 'uppercase', 'lowercase']].values)[:,None]
    

    This has a chance of working:

    hstack((X_train_gram, np.array(X_train[['exclamation', 'question', 'uppercase', 'lowercase']].values))
    

    though I'd suggest writing

    arr = np.array(X_train[['exclamation', 'question', 'uppercase', 'lowercase']].values
    M = sparse.coo_matrix(arr)
    sparse.hstack(( X_train_gram, M))
    

    It's more readable, and should be easier to debug if there are problems.