I'm trying to wrap my head around the most efficient way to deal with arrays of indeterminate size as outputs of RS kernels. I would send the index of the last relevant array slot in the out allocation, but I learned in the answer to my previous question, there's not a good way to pass a global back to java after kernel execution. I've decided to "zoom out" the process again which lead me to the pattern below.
For example let's say we have an input allocation containing a struct (or structs) that that contains two arrays of polar coordinates; something like set_pair from bellow:
typedef struct polar_tag{
uint8_t angle;
uint32_t mag;
} polar;
typedef struct polar_set_tag{
uint8_t filled_slots;
polar coordinates[60];
} polar_set;
typedef struct set_pair_tag{
polar_set probe_set;
polar_set candidate_set;
} set_pair;
We want to find similar coordinate pairs between the sets so we set up a kernel to decide which (if any) of the polar coordinates are similar. If they're similar we load it into an output allocation that looks something like "matching_set":
typedef struct matching_pair_tag{
uint8_t probe_index;
uint8_t candidate_index;
} matching_pair;
typedef struct matching_set_tag{
matching_pair pairs[120];
uint8_t filled_slots;
} matching_set;
Is creating allocations with instructions like "filled_slots" the most efficient (or only) way to handle this sort of indeterminate I/O with RS or is there a better way?
I think the way I would try to approach this is to do a two pass.
For the 0-2 case:
Setup: for each coordinate, allocate an array to hold the max expected number of pairs (2).
Pass 1: run over coords, look for pairs by comparing the current item to a subset of other coords. Choose subset to avoid duplicate answers when the kernel runs on the other coord being compared.
Pass 2: Merge the results from #1 back into a list or whatever other data structure you want. Could run as an invokable if the number of coordinates is small.
For the 0-N case:
This gets a lot harder. I'd likely do something similar to what's above but with the per-coord array sized for a typical number of pairs. For the (hopefully small) number of overflows, use atomics to reserve a slot in an overflow buffer. The catch here is I think most GPU drivers would not be very happy with the atomics today. Would run very well on the CPU ref.
There are a lot of ways to go about this. One important decision point revolves around how expensive the comparison is to find the points vs the cost of writing the result.