I am using scipy in python for sparse arrays/matrices.
I have a sparse array [generated out of dot product between other two]. The size of this array should not be big (can be hundreds) but this piece of code is being called many many times. So it must be very efficient.
What i need is an efficient way to create a string of 1 and 0 as follows:
nonzeros = np.nonzero(hashcode)
arr = ['0']*hashcode_length
if nonzeros != None:
for i in nonzeros[1]:
arr[i] = '1'
concatenated = ''.join(arr)
as an example... if the sparse array is of length 10 and the values are:
[0, 0, 1, 1, 0, 1, 0, 0, 0, 1]
The output should be:
"0011010001"
How can this code be improved?
Note:
My first idea is this:
In [1380]: x=sparse.coo_matrix([0,0,1,1,0,1,0,0,0,1])
In [1381]: ''.join([str(i) for i in x.A.ravel().tolist()])
Out[1381]: '0011010001'
Not necessarily better, but I think it illustrates some key issues. Is it useful to work with the sparse matrix or dense? How do you convert integers to strings? Is this a 1 row matix?
I can improve the string conversion with astype
:
In [1393]: ''.join(x.A.ravel().astype('U1'))
Out[1393]: '0011010001'
join
is performing a list iteration on the array.
With bytestrings (PY3, or normal string in PY2), tostring
is an alternative to join
. This just returns the databuffer as a string:
In [1451]: x.A.astype('S1').tostring()
Out[1451]: b'0011010001'
I can use the astype
on the sparse matrix, but there's a bug preventing me from making that dense:
In [1397]: x.astype('U1').A
...
ValueError: unsupported data types in input
===========================
A variation on your iteration; start with a 0
string; make it a list; sparse.nonzero
just returns the .col
values from the coo
format matrix.
In [1403]: ll # ll = '0'*x.shape[1]
Out[1403]: '0000000000'
In [1404]: ll=list(ll)
In [1405]: ll
Out[1405]: ['0', '0', '0', '0', '0', '0', '0', '0', '0', '0']
In [1406]: for i in x.col:
...: ll[i]='1'
...:
In [1407]: ll
Out[1407]: ['0', '0', '1', '1', '0', '1', '0', '0', '0', '1']
In [1408]: ''.join(ll)
Out[1408]: '0011010001'
Or doing the same thing with a string array:
In [1416]: ll=np.empty(x.shape[1], dtype='U1')
In [1417]: ll.fill('0')
In [1418]: ll
Out[1418]:
array(['0', '0', '0', '0', '0', '0', '0', '0', '0', '0'],
dtype='<U1')
In [1419]: ll[x.col]='1'
In [1420]: ll
Out[1420]:
array(['0', '0', '1', '1', '0', '1', '0', '0', '0', '1'],
dtype='<U1')
This avoids the loop(s) since I can assign multiple values at once.
For this small example, the list solutions might be just as fast or faster. Array versions have some array creation overhead, so they are best if the case is large.
Even for a coo
matrix with (1,1210) shape, this list iterative version is noticeably faster:
def foo1(x):
ll=list('0'*x.shape[1]) # ll=['0']*x.shape[1] is little faster
for i in x.col:
ll[i]='1'
return ''.join(ll)
If the matrix is not coo
, either convert it, x.tocoo()
or use x.nonzero()
(but look at its code).
=========
I've ignored your None
test. Why is that there? It could be dangerous
In [1448]: x.nonzero()[1] != None
/usr/local/bin/ipython3:1: FutureWarning: comparison to `None` will result in an elementwise object comparison in the future.
#!/usr/bin/python3
Out[1448]: True