Search code examples
pythonscipysparse-matrix

Convert manualy created DOK to CSR


For a machine learning task I need a sparse matrix in CSR format. As a first step I manualy build a DOK, looking like this (based on this guid):

dok = { (0,0): 1, (0,9): 1, (5,12): 1}
#the value is always 1
#the keys representing the position in the matrix
#my DOK has around 6 million entries like these

I know want to format this into CSR. If I understand the docs correct, this is only possible if my input is also a sparse matrix. But my DOK is not recognised as a sparse matrix, just as a dictionary. I was also not able to cast my DOK to a "real" DOK (Following error occured):

TypeError: Expected rank <=2 dense array or matrix.

So how can I convert my DOK to a CSR?


Solution

  • In [472]: dok = { (0,0): 1, (0,9): 1, (5,12): 1}  
    

    Make a blank dok matrix:

    In [473]: M = sparse.dok_matrix((20,20), dtype=int)                                                   
    In [474]: M                                                                                           
    Out[474]: 
    <20x20 sparse matrix of type '<class 'numpy.int64'>'
        with 0 stored elements in Dictionary Of Keys format>
    

    M is a subclass of Python dictionary. It used to be that we could use dictionary .update method to efficiently add new values from a Python dictionary, but that method has been disabled (try it to see the error message). However a backdoor has been provided:

    In [475]: M._update(dok)                                                                              
    In [476]: M                                                                                           
    Out[476]: 
    <20x20 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Dictionary Of Keys format>
    

    _update has a cautionary comment, that values are not checked, so use with caution.

    Once you have dok format, you can convert it to csr format:

    In [477]: M1=M.tocsr()                                                                                
    In [478]: M1                                                                                          
    Out[478]: 
    <20x20 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>
    In [479]: M1.A                                                                                        
    Out[479]: 
    array([[1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           ...
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
           [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
          dtype=int64)
    

    If you made an error in defining your dok, it probably will show up in the csr conversion.

    Another option is to iterate through your dok and construct the corresponding coo style inputs (data, rows, cols). Those were the original style, and well worth understanding and using.