java random montecarlo semantic-analysis

How to implement random sampling of a set of vectors in java?

I have a huge number of context vectors and I want to find the average cosine similarity of them. However, it's not efficient to calculate it through the whole set. That's why, I want to take a random sample from this set.

The problem is that each context vector explains a degree of the meaning for a word so I want to make a balanced selection(according to vector values). I searched and found that I can use Monte Carlo method. I also found a Gibbs Sampler example here: https://darrenjw.wordpress.com/2011/07/16/gibbs-sampler-in-various-languages-revisited/

However, I confused a little bit. As I understand, the method provides a normal distribution and generates double numbers. I did not understand how to implement this method in my case. Could somebody explain me how can I solve this problem?

Thanks in advance.

Solution

You don't want a random sample, you want a representative sample. One relatively efficient way to do this is to sort your elements in "strength" order, then take every nth element, which will give you a representative sample of size/n elements.

Try this:

// Given
Set<Vector> mySet;
int reductionFactor = 200; // eg sample 0.5% of elements

List<Vector> list = new ArrayList<>(mySet);
Collections.sort(list, new Comparator<Vector> {
    public int compare(Vector o1, Vector o2) {
        // however you compare "strength"
    }         
});
List<Vector> randomSample = new ArrayList<>(list.size() / reductionFactor );
for (int i = 0; i < list.size(); i += reductionFactor)
    randomSample.add(list.get(i);

The time complexity is O(n log n) due to the sort operation, and space complexity is O(n).