Search code examples
c++data-structurescontainerssparse-matrix

What type of sparse vector should I use?


Data

I have N different (sorted) vectors of indices (std::vector<unsigned int>). The indices are in the range [0; L-1]. Here are two rules of thumbs about this data:

  • Only about 0.1% to 10% of possible index are present anywhere
  • If an index is found in a given vector, then it will likely be found multiple times again in other vectors.

Hence a possible data set with N=10 vectors and with L = 200 could be

{45, 110, 119, 145, 170}
{9, 45, 110, 145, 178, 170}
{45, 145}
{45, 178, 183}
{45, 53, 110, 170}
{9, 119, 123, 179}
{9, 45, 119, 130, 131, 170, 190, 199}
{9, 45, 110, 170, 199}
{31, 45, 145}
{9, 178, 183}

Goal

I would like to compute the frequencies of every index. I would do something like

std::vector<double> computeFrequencies(std::vector<std::vector<unsigned int>>& data)
{
    assert(data.size() == N);

    std::vector<double> frequencies(L);
    for (unsigned Ni = 0 ; Ni < N ; Ni++)
    {
        for (unsigned i = 0 ; i < data[Ni].size() ; i++)
        {
            assert(data[Ni][i] < L)
            frequencies[data[Ni][i]]++;
        }
    }

    for (unsigned i = 0 ; i < L; i++)
    {
        frequencies[i] /= (double) N;
    }

    return(frequencies);    
}

I will then loop again through the object returned by the function computeFrequencies only once.

for (unsigned i = 0 ; i < L; i++)
{
    foo(frequencies[i]);
}

Question

The object frequencies contains a lot fo zeros and I should hence be using a sparse vector instead. I don't have much understanding of sparse matrices though. What type of sparse vector should I use?

I am considering using boost::numeric::ublas::coordinate_matrix<double><double> because as I loop through all N vectors, I would constantly be adding new non-zeros values and I think a coordinate matrix would be good for dealing with that. Note that generally speaking, for this function, I am more worried about RAM usage than about computational time.


Solution

  • It doesn't look like a sparse vector representation is a good fit for your problem.

    To accomplish your task as you describe it:

    1. Merge your sorted vectors into a single sorted vector. How to do an efficient K-way merge pops up here every now and then: merging N sorted files using K way merge
    2. Iterate through the new vector and count the number of duplicates of each entry (easy since they'll all be together) to get your frequencies and foo them as you go.

    You can even do both steps at the same time, entirely avoiding the need to copy the data into a new structure.