I have a list of documents.
I use TfidfVectorizer
to get the dt_matrix
, that is a sparse matrix <class 'scipy.sparse.csr.csr_matrix'>
comments = get_comments()
tfidf_vector = TfidfVectorizer(tokenizer=tokenizer, lowercase=False)
dt_matrix = tfidf_vector.fit_transform(comments)
dt_matrix
is something like this:
(0, 642) 0.14738966496831196
(0, 1577) 0.20377626427753473
(0, 1166) 0.2947793299366239
: :
(1046, 166) 0.500700591796996
Now I would like to add to this matrix the length of the documents as feature.
So I have the length
array. In the i-th position there is the length of the i-th document.
length=get_comments_length()
length
is a numpy array, something like this:
[141 56 79 ... 26 26 26]
I try to do hstack
:
features = np.hstack((dt_matrix, length))
I get this output:
ValueError: Found input variables with inconsistent numbers of samples: [1048, 1047]
I printed the shapes:
print(np.shape(length))
print(np.shape(dt_matrix))
And the output is:
(1047,)
(1047, 2078)
What am I doing wrong?
Edit:
sparse.hstack((dt_matrix, length.reshape((length.shape[0], 1))))
this is the working code. Using sparse
from scipy
, thanks to @hpaulij and @kederrak for helping
In [123]: from scipy import sparse
Make a scipy.sparse matrix:
In [124]: M = sparse.random(5,4,.2)
In [125]: M
Out[125]:
<5x4 sparse matrix of type '<class 'numpy.float64'>'
with 4 stored elements in COOrdinate format>
In [126]: print(M)
(0, 3) 0.006222105671732758
(1, 0) 0.7198559134274957
(2, 0) 0.3603986399431639
(4, 2) 0.9519927602284366
In [127]: M.A
Out[127]:
array([[0. , 0. , 0. , 0.00622211],
[0.71985591, 0. , 0. , 0. ],
[0.36039864, 0. , 0. , 0. ],
[0. , 0. , 0. , 0. ],
[0. , 0. , 0.95199276, 0. ]])
In [128]: type(M)
Out[128]: scipy.sparse.coo.coo_matrix
Trying to use a hstack
:
In [129]: np.hstack([M, np.arange(5)[:,None]])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-129-f06fc972039d> in <module>
----> 1 np.hstack([M, np.arange(5)[:,None]])
<__array_function__ internals> in hstack(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/shape_base.py in hstack(tup)
341 # As a special case, dimension 0 of 1-dimensional arrays is "horizontal"
342 if arrs and arrs[0].ndim == 1:
--> 343 return _nx.concatenate(arrs, 0)
344 else:
345 return _nx.concatenate(arrs, 1)
<__array_function__ internals> in concatenate(*args, **kwargs)
ValueError: all the input arrays must have same number of dimensions,
but the array at index 0 has 1 dimension(s) and the array at index 1
has 2 dimension(s)
Correct use of sparse.hstack
:
In [130]: sparse.hstack([M, np.arange(5)[:,None]])
Out[130]:
<5x5 sparse matrix of type '<class 'numpy.float64'>'
with 8 stored elements in COOrdinate format>
In [131]: _.A
Out[131]:
array([[0. , 0. , 0. , 0.00622211, 0. ],
[0.71985591, 0. , 0. , 0. , 1. ],
[0.36039864, 0. , 0. , 0. , 2. ],
[0. , 0. , 0. , 0. , 3. ],
[0. , 0. , 0.95199276, 0. , 4. ]])
If the 2nd array is shape (5,) instead of (5,1) I get your latest error:
In [132]: sparse.hstack([M, np.arange(5)])
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-132-defd4158f59e> in <module>
----> 1 sparse.hstack([M, np.arange(5)])
/usr/local/lib/python3.6/dist-packages/scipy/sparse/construct.py in hstack(blocks, format, dtype)
463
464 """
--> 465 return bmat([blocks], format=format, dtype=dtype)
466
467
/usr/local/lib/python3.6/dist-packages/scipy/sparse/construct.py in bmat(blocks, format, dtype)
584 exp=brow_lengths[i],
585 got=A.shape[0]))
--> 586 raise ValueError(msg)
587
588 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions. Got blocks[0,1].shape[0] == 1, expected 5.