EDIT: I have reformulated my question into something more productive, and will provide an answer below. The old version of this question is below still.
I am implementing an optimized SpMV kernel for the CSR format in RVV 0.7.1. In C, SpMV can be implemented as such:
for (int i = 0; i < numRows; i++)
for (int j = rowPtr[i]; j < rowPtr[i+1]; j++)
y[i] += val[j] * x[colIdx[j]];
Vectorizing this is simple initially, by loading coldIdx
, using it to perform an indexed load of x
, loading val
and multiplying the two. After this, a reduction sum (v{f}redsum.vs
) must be performed in each row, adding all the products within them and storing them in the respective index of y
. The challenge is creating the mask from rowPtr
, so that only the elements belonging to a given row are active for each reduction.
My initial thought (which should provide context regarding the original question) was using the indexes in rowPtr
to place 1's in the respective elements of a vector register, then using viota
and an element shift to create a register containing the row each element belongs to. However, that is apparently not viable.
How do I build the required masks?
Old question
I am trying to use the RISC-V vector extension (0.7.1) to create a mask register based off an array of indices. For example, take the following array of indices, idx array
. The goal is to use the indices present in it to put 1's in the respective mask elements, as you see with the mask
below:
idx array | 0 2 3 4 4 6 8
elem idx | 0 1 2 3 4 5 6 7 8
mask | 1 0 1 1 1 0 1 0 1
At first I was mistakenly looking at the vrgather.vx
instruction, but it does not do quite what I want, as it populates all elements. I would like the code to be quite performant as well, so if it's possible for it consist of mostly vector instructions that would be ideal. I would greatly appreciate it if someone could point me in the right direction.
Credit for the solution goes to camel-cdr!
Take the following rowPtr
array:
0 2 3 5 7
Lets consider the third iteration ([3,5[
), which should be more illustrative. The goal is to arrive at the following mask, enabling elements 3 and 4:
00011000
The first step is using vid
to write each element's index to a vector. With that, we can use vmseq
(set mask if equal) twice, with elements i
and i+1
of rowPtr
, which will result in two registers like so, with a one in element rowPtr[i]
or rowPtr[i+1]
:
idx | 01234567
i | 00010000
i+1 | 00000100
Then we can use vmsbf.m
(set-before-first) on both, which sets all bits before the first active mask bit. By XOR'ing the two, we arrive at the final desired mask:
sbf i | 11100000
sbf i+1 | 11111000
xor | 00011000
By iterating through rowPtr
and repeating this, it is possible to generate a mask for the elements of each individual row. In short, we can write this operation as:
vmxor(vmsbf(vmseq(vid,rowPtr[i])), vmsbf(vmseq(vid,rowPtr[i+1])))