Search code examples
pythonarraysjoinscipysparse-matrix

How to build a string out of sparse array efficiently in python


I am using scipy in python for sparse arrays/matrices.

I have a sparse array [generated out of dot product between other two]. The size of this array should not be big (can be hundreds) but this piece of code is being called many many times. So it must be very efficient.

What i need is an efficient way to create a string of 1 and 0 as follows:

    nonzeros = np.nonzero(hashcode)
    arr = ['0']*hashcode_length
    if nonzeros != None:
        for i in nonzeros[1]:
            arr[i] = '1'
    concatenated = ''.join(arr)

as an example... if the sparse array is of length 10 and the values are:

    [0, 0, 1, 1, 0, 1, 0, 0, 0, 1]

The output should be:

    "0011010001"

How can this code be improved?

Note:

  • sorry for not putting all the code, since it is big and has too much not-relevant details
  • Let me know if the question is not clear or some more details are needed

Solution

  • My first idea is this:

    In [1380]: x=sparse.coo_matrix([0,0,1,1,0,1,0,0,0,1])
    In [1381]: ''.join([str(i) for i in x.A.ravel().tolist()])
    Out[1381]: '0011010001'
    

    Not necessarily better, but I think it illustrates some key issues. Is it useful to work with the sparse matrix or dense? How do you convert integers to strings? Is this a 1 row matix?

    I can improve the string conversion with astype:

    In [1393]: ''.join(x.A.ravel().astype('U1'))
    Out[1393]: '0011010001'
    

    join is performing a list iteration on the array.

    With bytestrings (PY3, or normal string in PY2), tostring is an alternative to join. This just returns the databuffer as a string:

    In [1451]: x.A.astype('S1').tostring()
    Out[1451]: b'0011010001'
    

    I can use the astype on the sparse matrix, but there's a bug preventing me from making that dense:

    In [1397]: x.astype('U1').A
    ...
    ValueError: unsupported data types in input
    

    ===========================

    A variation on your iteration; start with a 0 string; make it a list; sparse.nonzero just returns the .col values from the coo format matrix.

    In [1403]: ll    # ll = '0'*x.shape[1]
    Out[1403]: '0000000000'
    In [1404]: ll=list(ll)
    In [1405]: ll
    Out[1405]: ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0']
    In [1406]: for i in x.col:
          ...:     ll[i]='1'
          ...:     
    In [1407]: ll
    Out[1407]: ['0', '0', '1', '1', '0', '1', '0', '0', '0', '1']
    In [1408]: ''.join(ll)
    Out[1408]: '0011010001'
    

    Or doing the same thing with a string array:

    In [1416]: ll=np.empty(x.shape[1], dtype='U1')
    In [1417]: ll.fill('0')
    In [1418]: ll
    Out[1418]: 
    array(['0', '0', '0', '0', '0', '0', '0', '0', '0', '0'], 
          dtype='<U1')
    In [1419]: ll[x.col]='1'
    In [1420]: ll
    Out[1420]: 
    array(['0', '0', '1', '1', '0', '1', '0', '0', '0', '1'], 
          dtype='<U1')
    

    This avoids the loop(s) since I can assign multiple values at once.

    For this small example, the list solutions might be just as fast or faster. Array versions have some array creation overhead, so they are best if the case is large.

    Even for a coo matrix with (1,1210) shape, this list iterative version is noticeably faster:

    def foo1(x):
        ll=list('0'*x.shape[1])   # ll=['0']*x.shape[1] is little faster
        for i in x.col:
            ll[i]='1'
        return ''.join(ll)
    

    If the matrix is not coo, either convert it, x.tocoo() or use x.nonzero() (but look at its code).

    =========

    I've ignored your None test. Why is that there? It could be dangerous

    In [1448]: x.nonzero()[1] != None
    /usr/local/bin/ipython3:1: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
      #!/usr/bin/python3
    Out[1448]: True