print(X_train_bow.shape) #Output: (897, 2794)
print(type(X_train_bow)) #Output: <class 'scipy.sparse.csr.csr_matrix'>
x_train_groups = [X_train_bow[i::5] for i in range(5)]
print(x_train_groups[0].shape) #Output: (299, 2794)
print(type(X_train_bow[0])) #Output: <class 'scipy.sparse.csr.csr_matrix'>
K = 2
train_data = []
test_data = []
for j in range(0, 5):
if(j != K):
train_data.extend(x_train_groups[j])
test_data.extend(x_train_groups[K])
print(np.asarray(train_data).shape) #Output: (598,)
print(np.asarray(test_data).shape) #Output: (299,)
I'm trying k-fold cross-validation. So I have created a method that merges train and test data. But the problem is that as when I called np.asarray, it returns different shape array that original data shape. You can see the code. I have also printed output for help.
Let's make a small demo csr matrix:
In [212]: M = (sparse.random(12,3,.5, 'csr')*10).astype(int)
In [213]: M
Out[213]:
<12x3 sparse matrix of type '<class 'numpy.int64'>'
with 18 stored elements in Compressed Sparse Row format>
In [214]: M.A
Out[214]:
array([[3, 1, 3],
[0, 0, 1],
[1, 0, 9],
[0, 6, 0],
[5, 4, 0],
[4, 5, 6],
[3, 0, 0],
[0, 0, 5],
[0, 0, 2],
[0, 1, 0],
[0, 0, 0],
[0, 9, 0]])
Your grouping produces a list of small csr matrices
In [216]: alist = [M[i::3] for i in range(3)]
In [217]: alist
Out[217]:
[<4x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>,
<4x3 sparse matrix of type '<class 'numpy.int64'>'
with 4 stored elements in Compressed Sparse Row format>,
<4x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>]
Look at the K
case:
In [218]: data = []
In [219]: data.extend(alist[2])
In [220]: data
Out[220]:
[<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>,
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>,
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>,
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>]
List extend
adds the elements of the iterable to the list (in a 'flat' sense). Iteration on sparse matrix (alist[2]
) yields a bunch of 1 row sparse matrices (still 2d).
We can join them using sparse.vstack
:
In [221]: sparse.vstack(data)
Out[221]:
<4x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
In [222]: sparse.vstack(data).A
Out[222]:
array([[1, 0, 9],
[4, 5, 6],
[0, 0, 2],
[0, 9, 0]])
which is just the same as the source of submatrix.
In [223]: alist[2]
Out[223]:
<4x3 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
In [224]: alist[2].A
Out[224]:
array([[1, 0, 9],
[4, 5, 6],
[0, 0, 2],
[0, 9, 0]])
Putting that data
list in an array
just makes a 1d object dtype array of 1 row sparse matrices. The matrices are just foreign objects to np.array
. As general rule don't count on numpy
functions doing the 'right' thing with sparse matrices.
In [225]: np.array(data)
Out[225]:
array([<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 2 stored elements in Compressed Sparse Row format>,
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 3 stored elements in Compressed Sparse Row format>,
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>,
<1x3 sparse matrix of type '<class 'numpy.int64'>'
with 1 stored elements in Compressed Sparse Row format>], dtype=object)
Don't just look at shapes. Check the dtype
, and examine some elements!