I have a huge number of context vectors and I want to find the average cosine similarity of them. However, it's not efficient to calculate it through the whole set. That's why, I want to take a random sample from this set.
The problem is that each context vector explains a degree of the meaning for a word so I want to make a balanced selection(according to vector values). I searched and found that I can use Monte Carlo method. I also found a Gibbs Sampler example here: https://darrenjw.wordpress.com/2011/07/16/gibbs-sampler-in-various-languages-revisited/
However, I confused a little bit. As I understand, the method provides a normal distribution and generates double numbers. I did not understand how to implement this method in my case. Could somebody explain me how can I solve this problem?
Thanks in advance.
You don't want a random sample, you want a representative sample. One relatively efficient way to do this is to sort your elements in "strength" order, then take every nth element, which will give you a representative sample of size/n elements.
Try this:
// Given
Set<Vector> mySet;
int reductionFactor = 200; // eg sample 0.5% of elements
List<Vector> list = new ArrayList<>(mySet);
Collections.sort(list, new Comparator<Vector> {
public int compare(Vector o1, Vector o2) {
// however you compare "strength"
}
});
List<Vector> randomSample = new ArrayList<>(list.size() / reductionFactor );
for (int i = 0; i < list.size(); i += reductionFactor)
randomSample.add(list.get(i);
The time complexity is O(n log n) due to the sort operation, and space complexity is O(n).