Search code examples
javarandommontecarlosemantic-analysis

How to implement random sampling of a set of vectors in java?


I have a huge number of context vectors and I want to find the average cosine similarity of them. However, it's not efficient to calculate it through the whole set. That's why, I want to take a random sample from this set.

The problem is that each context vector explains a degree of the meaning for a word so I want to make a balanced selection(according to vector values). I searched and found that I can use Monte Carlo method. I also found a Gibbs Sampler example here: https://darrenjw.wordpress.com/2011/07/16/gibbs-sampler-in-various-languages-revisited/

However, I confused a little bit. As I understand, the method provides a normal distribution and generates double numbers. I did not understand how to implement this method in my case. Could somebody explain me how can I solve this problem?

Thanks in advance.


Solution

  • You don't want a random sample, you want a representative sample. One relatively efficient way to do this is to sort your elements in "strength" order, then take every nth element, which will give you a representative sample of size/n elements.

    Try this:

    // Given
    Set<Vector> mySet;
    int reductionFactor = 200; // eg sample 0.5% of elements
    
    List<Vector> list = new ArrayList<>(mySet);
    Collections.sort(list, new Comparator<Vector> {
        public int compare(Vector o1, Vector o2) {
            // however you compare "strength"
        }         
    });
    List<Vector> randomSample = new ArrayList<>(list.size() / reductionFactor );
    for (int i = 0; i < list.size(); i += reductionFactor)
        randomSample.add(list.get(i);
    

    The time complexity is O(n log n) due to the sort operation, and space complexity is O(n).