I have seen in many kaggle kernels and tutorials, average word embeddings is considered to get embedding of a sentence. But, i am wondering if this is a correct approach.Since it discards the positional information of the words in the sentence. is there a better way to combine embedding? maybe hierarchically combining them in a particular way?
If you need a simple but yet effective approach, Sif embedding is perfectly fine. It averages word vector in a sentence and removes its first principal component. It is much superior to averaging word vectors. The code available online here. Here is the main part:
svd = TruncatedSVD(n_components=1, random_state=rand_seed, n_iter=20)
svd.fit(all_vector_representation)
svd = svd.components_
XX2 = all_vector_representation - all_vector_representation.dot(svd.transpose()) * svd
Where all_vector_representation
is the average embedding of all sentences in your dataset.
Other sophisticated approaches also exist out there like ELMO, Transformer and etc.