Search code examples
pythonnumpyscipysparse-matrixnumpy-ndarray

np.asarray() gives me one column array where data was multi column


print(X_train_bow.shape) #Output: (897, 2794)
print(type(X_train_bow)) #Output: <class 'scipy.sparse.csr.csr_matrix'>

x_train_groups = [X_train_bow[i::5] for i in range(5)]

print(x_train_groups[0].shape) #Output: (299, 2794)
print(type(X_train_bow[0])) #Output: <class 'scipy.sparse.csr.csr_matrix'>

K = 2
train_data = []
test_data = []

for j in range(0, 5):
    if(j != K):
        train_data.extend(x_train_groups[j]) 
test_data.extend(x_train_groups[K])   

print(np.asarray(train_data).shape) #Output: (598,)
print(np.asarray(test_data).shape) #Output: (299,)

I'm trying k-fold cross-validation. So I have created a method that merges train and test data. But the problem is that as when I called np.asarray, it returns different shape array that original data shape. You can see the code. I have also printed output for help.


Solution

  • Let's make a small demo csr matrix:

    In [212]: M = (sparse.random(12,3,.5, 'csr')*10).astype(int)                    
    In [213]: M                                                                     
    Out[213]: 
    <12x3 sparse matrix of type '<class 'numpy.int64'>'
        with 18 stored elements in Compressed Sparse Row format>
    In [214]: M.A                                                                   
    Out[214]: 
    array([[3, 1, 3],
           [0, 0, 1],
           [1, 0, 9],
           [0, 6, 0],
           [5, 4, 0],
           [4, 5, 6],
           [3, 0, 0],
           [0, 0, 5],
           [0, 0, 2],
           [0, 1, 0],
           [0, 0, 0],
           [0, 9, 0]])
    

    Your grouping produces a list of small csr matrices

    In [216]: alist = [M[i::3] for i in range(3)]                                   
    In [217]: alist                                                                 
    Out[217]: 
    [<4x3 sparse matrix of type '<class 'numpy.int64'>'
        with 7 stored elements in Compressed Sparse Row format>,
     <4x3 sparse matrix of type '<class 'numpy.int64'>'
        with 4 stored elements in Compressed Sparse Row format>,
     <4x3 sparse matrix of type '<class 'numpy.int64'>'
        with 7 stored elements in Compressed Sparse Row format>]
    

    Look at the K case:

    In [218]: data = []                                                             
    In [219]: data.extend(alist[2])                                                 
    In [220]: data                                                                  
    Out[220]: 
    [<1x3 sparse matrix of type '<class 'numpy.int64'>'
        with 2 stored elements in Compressed Sparse Row format>,
     <1x3 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>,
     <1x3 sparse matrix of type '<class 'numpy.int64'>'
        with 1 stored elements in Compressed Sparse Row format>,
     <1x3 sparse matrix of type '<class 'numpy.int64'>'
        with 1 stored elements in Compressed Sparse Row format>]
    

    List extend adds the elements of the iterable to the list (in a 'flat' sense). Iteration on sparse matrix (alist[2]) yields a bunch of 1 row sparse matrices (still 2d).

    We can join them using sparse.vstack:

    In [221]: sparse.vstack(data)                                                   
    Out[221]: 
    <4x3 sparse matrix of type '<class 'numpy.int64'>'
        with 7 stored elements in Compressed Sparse Row format>
    In [222]: sparse.vstack(data).A                                                 
    Out[222]: 
    array([[1, 0, 9],
           [4, 5, 6],
           [0, 0, 2],
           [0, 9, 0]])
    

    which is just the same as the source of submatrix.

    In [223]: alist[2]                                                              
    Out[223]: 
    <4x3 sparse matrix of type '<class 'numpy.int64'>'
        with 7 stored elements in Compressed Sparse Row format>
    In [224]: alist[2].A                                                            
    Out[224]: 
    array([[1, 0, 9],
           [4, 5, 6],
           [0, 0, 2],
           [0, 9, 0]])
    

    Putting that data list in an array just makes a 1d object dtype array of 1 row sparse matrices. The matrices are just foreign objects to np.array. As general rule don't count on numpy functions doing the 'right' thing with sparse matrices.

    In [225]: np.array(data)                                                        
    Out[225]: 
    array([<1x3 sparse matrix of type '<class 'numpy.int64'>'
        with 2 stored elements in Compressed Sparse Row format>,
           <1x3 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>,
           <1x3 sparse matrix of type '<class 'numpy.int64'>'
        with 1 stored elements in Compressed Sparse Row format>,
           <1x3 sparse matrix of type '<class 'numpy.int64'>'
        with 1 stored elements in Compressed Sparse Row format>], dtype=object)
    

    Don't just look at shapes. Check the dtype, and examine some elements!