Search code examples
pythonscipyscikit-learnsparse-matrix

Stacking two sparse matrices with different dimensions


I have two sparse-matrices (created out of sklearn HashVectorizer, from two sets of features - each set corresponds to a feature). I want to concatenate them to later use them for clustering. But, I am facing a problem with dimensions, as the two matrices do not have the same row dimensions.

Here is an example:

Xa = [-0.57735027 -0.57735027  0.57735027 -0.57735027 -0.57735027  0.57735027
  0.5         0.5        -0.5         0.5         0.5        -0.5         0.5
  0.5        -0.5         0.5        -0.5         0.5         0.5        -0.5
  0.5         0.5       ]

Xb = [-0.57735027 -0.57735027  0.57735027 -0.57735027  0.57735027  0.57735027
 -0.5         0.5         0.5         0.5        -0.5        -0.5         0.5
 -0.5        -0.5        -0.5         0.5         0.5       ]

Both Xa and Xb are of type <class 'scipy.sparse.csr.csr_matrix'>. Shapes are Xa.shape = (6, 1048576) Xb.shape = (5, 1048576). The error I get is (which I know now why it happens):

    X = hstack((Xa, Xb))
  File "/usr/local/lib/python2.7/site-packages/scipy/sparse/construct.py", line 464, in hstack
    return bmat([blocks], format=format, dtype=dtype)
  File "/usr/local/lib/python2.7/site-packages/scipy/sparse/construct.py", line 581, in bmat
    'row dimensions' % i)
ValueError: blocks[0,:] has incompatible row dimensions

Is there a way to stack the sparse-matrices despite their irregular dimensions? Maybe with some padding?

I have looked into these posts:


Solution

  • You can pad it with an empty sparse matrix.

    You want to horizontaly stack it so you need to pad the smaller matrix so that it has the same number of rows as the larger matrix. For that you vertically stack it with a matrix of shape (difference in number of rows, number of columns of original matrix).

    Like this:

    from scipy.sparse import csr_matrix
    from scipy.sparse import hstack
    from scipy.sparse import vstack
    
    # Create 2 empty sparse matrix for demo
    Xa = csr_matrix((4, 4))
    Xb = csr_matrix((3, 5))
    
    
    diff_n_rows = Xa.shape[0] - Xb.shape[0]
    
    Xb_new = vstack((Xb, csr_matrix((diff_n_rows, Xb.shape[1])))) 
    #where diff_n_rows is the difference of the number of rows between Xa and Xb
    
    X = hstack((Xa, Xb_new))
    X
    

    Which results in:

    <4x9 sparse matrix of type '<class 'numpy.float64'>'
        with 0 stored elements in COOrdinate format>