I have a list of values that I'm using a loop to convert to a scipy.sparse.dok_matrix
. I'm aware of numpy.bincount
but it doesn't work with sparse matrices. I'm wondering if there is a more efficient way to perform this conversion because the construction time for a dok_matrix
is really long.
Example below for one row but I'm scaling to a 2D matrix by looping. The number of times a value x
appears in the input list is the value of the x
th element of the result matrix.
values = [1, 3, 3, 4]
expected_result = [0, 1, 0, 2, 1]
matrix = dok_matrix((1, MAXIMUM_EXPECTED_VALUE))
for value in values:
matrix[0, value] = matrix.get((0, card)) + 1
MAXIMUM_EXPECTED_VALUE
is in the order of 100000000 but len(values) < 100
, which is why I'm using a sparse matrix. Possibly off-topic: there are also only a little over 10000 actual values that are used in the range of MAXIMUM_EXPECTED_VALUE
but I think hashing to a contiguous range and converting back might be more complicated.
Looks like the standard coo
style inputs suits you case:
In [143]: from scipy import sparse
In [144]: values = [1,3,3,4]
In [145]: col = np.array(values)
In [146]: row = np.zeros_like(col)
In [147]: data = np.ones_like(col)
In [148]: M = sparse.coo_matrix((data, (row,col)), shape=(1,10))
In [149]: M
Out[149]:
<1x10 sparse matrix of type '<class 'numpy.int64'>'
with 4 stored elements in COOrdinate format>
In [150]: M.A
Out[150]: array([[0, 1, 0, 2, 1, 0, 0, 0, 0, 0]])