python numpy scipy sparse-matrix numpy-ndarray

np.asarray() gives me one column array where data was multi column

print(X_train_bow.shape) #Output: (897, 2794)
print(type(X_train_bow)) #Output: <class 'scipy.sparse.csr.csr_matrix'>

x_train_groups = [X_train_bow[i::5] for i in range(5)]

print(x_train_groups[0].shape) #Output: (299, 2794)
print(type(X_train_bow[0])) #Output: <class 'scipy.sparse.csr.csr_matrix'>

K = 2
train_data = []
test_data = []

for j in range(0, 5):
    if(j != K):
        train_data.extend(x_train_groups[j]) 
test_data.extend(x_train_groups[K])   

print(np.asarray(train_data).shape) #Output: (598,)
print(np.asarray(test_data).shape) #Output: (299,)

I'm trying k-fold cross-validation. So I have created a method that merges train and test data. But the problem is that as when I called np.asarray, it returns different shape array that original data shape. You can see the code. I have also printed output for help.

Solution

Let's make a small demo csr matrix:

In [212]: M = (sparse.random(12,3,.5, 'csr')*10).astype(int)                    
In [213]: M                                                                     
Out[213]: 
<12x3 sparse matrix of type '<class 'numpy.int64'>'
    with 18 stored elements in Compressed Sparse Row format>
In [214]: M.A                                                                   
Out[214]: 
array([[3, 1, 3],
       [0, 0, 1],
       [1, 0, 9],
       [0, 6, 0],
       [5, 4, 0],
       [4, 5, 6],
       [3, 0, 0],
       [0, 0, 5],
       [0, 0, 2],
       [0, 1, 0],
       [0, 0, 0],
       [0, 9, 0]])

Your grouping produces a list of small csr matrices

In [216]: alist = [M[i::3] for i in range(3)]                                   
In [217]: alist                                                                 
Out[217]: 
[<4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>,
 <4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 4 stored elements in Compressed Sparse Row format>,
 <4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>]

Look at the K case:

In [218]: data = []                                                             
In [219]: data.extend(alist[2])                                                 
In [220]: data                                                                  
Out[220]: 
[<1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>,
 <1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>,
 <1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 1 stored elements in Compressed Sparse Row format>,
 <1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 1 stored elements in Compressed Sparse Row format>]

List extend adds the elements of the iterable to the list (in a 'flat' sense). Iteration on sparse matrix (alist[2]) yields a bunch of 1 row sparse matrices (still 2d).

We can join them using sparse.vstack:

In [221]: sparse.vstack(data)                                                   
Out[221]: 
<4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>
In [222]: sparse.vstack(data).A                                                 
Out[222]: 
array([[1, 0, 9],
       [4, 5, 6],
       [0, 0, 2],
       [0, 9, 0]])

which is just the same as the source of submatrix.

In [223]: alist[2]                                                              
Out[223]: 
<4x3 sparse matrix of type '<class 'numpy.int64'>'
    with 7 stored elements in Compressed Sparse Row format>
In [224]: alist[2].A                                                            
Out[224]: 
array([[1, 0, 9],
       [4, 5, 6],
       [0, 0, 2],
       [0, 9, 0]])

Putting that data list in an array just makes a 1d object dtype array of 1 row sparse matrices. The matrices are just foreign objects to np.array. As general rule don't count on numpy functions doing the 'right' thing with sparse matrices.

In [225]: np.array(data)                                                        
Out[225]: 
array([<1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 2 stored elements in Compressed Sparse Row format>,
       <1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>,
       <1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 1 stored elements in Compressed Sparse Row format>,
       <1x3 sparse matrix of type '<class 'numpy.int64'>'
    with 1 stored elements in Compressed Sparse Row format>], dtype=object)

Don't just look at shapes. Check the dtype, and examine some elements!