Search code examples
pythontensorflownlpartificial-intelligence

Should data feed into Universal Sentence Encoder be normalized?


I am currently working with Tensor Flow's Universal Sentence Encoder (https://arxiv.org/pdf/1803.11175.pdf) for my B.Sc. thesis where I study extractive summarisation techniques. In the vast majority of techniques for this task (like https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/11225/10855), the sentences are first normalized (lowercasing, stop word removal, lemmantisation), but I couldn't find a hint whether sentences feed into the USE should first be normalized. Is that the case? Does is matter?


Solution

  • The choice really depends on the application of design.

    Regarding stop word removal and lemmatization: these operations in general removes some contents from the text, hence, it can remove the information. However, if it doesn't make an impact, then you can remove. (It is always best to try out both. In general the performance differences shouldn't be too much).

    Lowercasing depends on the pre-trained model that you use (For example, in BERT, you have bert-base-uncased and bert-base-cased) and choice of application. One simple way to verify is, input a text into USE model, obtain it's sentence embeddings, then lowercase the same input text and obtain it's sentence embeddings. If they are same, that means your model is case insensitive. However, if it gives different embedding, then it's case sensitive. (By running the program provided here, it appears that USE is case-sensitive). The choice of lower-casing is again application dependent.