Search code examples
pythonnumpyscikit-learntfidfvectorizer

how to properly use numpy hstack


I have a list of documents. I use TfidfVectorizer to get the dt_matrix, that is a sparse matrix <class 'scipy.sparse.csr.csr_matrix'>

comments = get_comments()
tfidf_vector = TfidfVectorizer(tokenizer=tokenizer, lowercase=False)
dt_matrix = tfidf_vector.fit_transform(comments)

dt_matrix is something like this:

  (0, 642)  0.14738966496831196
  (0, 1577) 0.20377626427753473
  (0, 1166) 0.2947793299366239
  : :
 (1046, 166)    0.500700591796996

Now I would like to add to this matrix the length of the documents as feature. So I have the length array. In the i-th position there is the length of the i-th document.

length=get_comments_length()

length is a numpy array, something like this:

[141  56  79 ...  26  26  26]

I try to do hstack:

features = np.hstack((dt_matrix, length))

I get this output:

ValueError: Found input variables with inconsistent numbers of samples: [1048, 1047]

I printed the shapes:

print(np.shape(length))
print(np.shape(dt_matrix))

And the output is:

(1047,)
(1047, 2078)

What am I doing wrong?

Edit:

sparse.hstack((dt_matrix, length.reshape((length.shape[0], 1)))) this is the working code. Using sparse from scipy, thanks to @hpaulij and @kederrak for helping


Solution

  • In [123]: from scipy import sparse  
    

    Make a scipy.sparse matrix:

    In [124]: M = sparse.random(5,4,.2)                                                            
    In [125]: M                                                                                    
    Out[125]: 
    <5x4 sparse matrix of type '<class 'numpy.float64'>'
        with 4 stored elements in COOrdinate format>
    In [126]: print(M)                                                                             
      (0, 3)    0.006222105671732758
      (1, 0)    0.7198559134274957
      (2, 0)    0.3603986399431639
      (4, 2)    0.9519927602284366
    In [127]: M.A                                                                                  
    Out[127]: 
    array([[0.        , 0.        , 0.        , 0.00622211],
           [0.71985591, 0.        , 0.        , 0.        ],
           [0.36039864, 0.        , 0.        , 0.        ],
           [0.        , 0.        , 0.        , 0.        ],
           [0.        , 0.        , 0.95199276, 0.        ]])
    In [128]: type(M)                                                                              
    Out[128]: scipy.sparse.coo.coo_matrix
    

    Trying to use a hstack:

    In [129]: np.hstack([M, np.arange(5)[:,None]])                                                 
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-129-f06fc972039d> in <module>
    ----> 1 np.hstack([M, np.arange(5)[:,None]])
    
    <__array_function__ internals> in hstack(*args, **kwargs)
    
    /usr/local/lib/python3.6/dist-packages/numpy/core/shape_base.py in hstack(tup)
        341     # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
        342     if arrs and arrs[0].ndim == 1:
    --> 343         return _nx.concatenate(arrs, 0)
        344     else:
        345         return _nx.concatenate(arrs, 1)
    
    <__array_function__ internals> in concatenate(*args, **kwargs)
    
    ValueError: all the input arrays must have same number of dimensions, 
    but the array at index 0 has 1 dimension(s) and the array at index 1
    has 2 dimension(s)
    

    Correct use of sparse.hstack:

    In [130]: sparse.hstack([M, np.arange(5)[:,None]])                                             
    Out[130]: 
    <5x5 sparse matrix of type '<class 'numpy.float64'>'
        with 8 stored elements in COOrdinate format>
    In [131]: _.A                                                                                  
    Out[131]: 
    array([[0.        , 0.        , 0.        , 0.00622211, 0.        ],
           [0.71985591, 0.        , 0.        , 0.        , 1.        ],
           [0.36039864, 0.        , 0.        , 0.        , 2.        ],
           [0.        , 0.        , 0.        , 0.        , 3.        ],
           [0.        , 0.        , 0.95199276, 0.        , 4.        ]])
    

    If the 2nd array is shape (5,) instead of (5,1) I get your latest error:

    In [132]: sparse.hstack([M, np.arange(5)])                                                     
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-132-defd4158f59e> in <module>
    ----> 1 sparse.hstack([M, np.arange(5)])
    
    /usr/local/lib/python3.6/dist-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
        463 
        464     """
    --> 465     return bmat([blocks], format=format, dtype=dtype)
        466 
        467 
    
    /usr/local/lib/python3.6/dist-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
        584                                                     exp=brow_lengths[i],
        585                                                     got=A.shape[0]))
    --> 586                     raise ValueError(msg)
        587 
        588                 if bcol_lengths[j] == 0:
    
    ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 5.