Assuming that I have a word similarity score for each pair of words in two sentences, what is a decent approach to determining the overall sentence similarity from those scores?
The word scores are calculated using cosine similarity from vectors representing each word.
Now that I have individual word scores, is it too naive to sum the individual word scores and divide by the total word count of both sentences to get a score for the two sentences?
I've read about further constructing vectors to represent the sentences, using the word scores, and then again using cosine similarity to compare the sentences. But I'm not familiar with how to construct sentence vectors from the existing word scores. Nor am I aware of what the tradeoffs are compared with the naive approach described above, which at the very least, I can easily comprehend. :).
Any insights are greatly appreciated.
Thanks.
What I ended up doing, was taking the mean of each set of vectors, and then applying cosine-similarity to the two means, resulting in a score for the sentences.
I'm not sure how mathematically sound this approach is, but I've seen it done in other places (like python's gensim).