In this paper Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, it uses the word hashing technique to convert a one-hot representation of a word to a (sparse) vector of letter trigrams.
From my understanding, for example, a word look
is first decomposed in to letter trigrams [#lo, loo, ook, ok#]
then is represented as a vector with ones for each of these trigrams and zeros elsewhere. By doing this it can reduce the dimension of a word vector while having very few collisions as said in the paper.
My confusion is, normally if we use bag-of-words representations to represent a document based on the one-hot representation, we just count the occurrences of each word. However I can imagine if we use bag-of-words based on letter trigrams there'll easily be different words sharing common patterns so it seems difficult to recover the information of which words are in the document by such representation.
Did I understand correctly? How was this issue solved? or it doesn't really matter to the query/title experiment in the paper?
However I can imagine if we use bag-of-words based on letter trigrams there'll easily be different words sharing common patterns so it seems difficult to recover the information of which words are in the document by such representation.
That's correct, because the model does not explicitly aim to learn the posterior probabilities by using the information from the words. Rather it uses the information from the tri-grams.
How was this issue solved? or it doesn't really matter to the query/title experiment in the paper?
This issue can be solved by adding a CNN/LSTM layer to represent a higher (close to words) abstraction from the trigram inputs. The research reported in this paper employs a CNN on top of trigram inputs as shown below.