Search code examples
tensorflownlpword-embedding

Universal sentence encoding embedding digits very similar


I have task of sentence similarity where i calculate the cosine of two sentence to decide how similar they are . It seems that for sentence with digits the similarity is not affected no matter how "far" the numbers are . For an example:

a = generate_embedding('issue 845')

b = generate_embedding('issue 11')

cosine_sim(a,b) = 0.9307

is there a way to distance the hashing of numbers or any other hack to handle that issue?


Solution

  • If your sentence embedding are produced using the embeddings of individual words (or tokens), then a hack could be the following:

    to add dimensions to the word embedding. These dimensions would be set to zero for all non-numeric tokens, and for numeric tokens these dimensions would contain values reflecting the magnitude of the numeric value. It would get a bit mathematical because cosine similarity uses angles, so the extra dimensions added to the embedding would have to reflect the magnitude of the numeric values through larger or smaller angles.

    An easier (workaround) hack would be to extract the numeric values from the sentences using regular expressions and compute their distance and combine that information with the similarity score in order to obtain a new similarity score.