Search code examples
pythonnlpmultilingualsimilarityword-embedding

What is the best approach to measure a similarity between texts in multiple languages in python?


So, I have a task where I need to measure the similarity between two texts. These texts are short descriptions of products from a grocery store. They always include a name of a product (for example, milk), and they may include a producer and/or size, and maybe some other characteristics of a product.

I have a whole set of such texts, and then, when a new one arrives, I need to determine whether there are similar products in my database and measure how similar they are (on a scale from 0 to 100%).

The thing is: the texts may be in two different languages: Ukrainian and Russian. Also, if there is a foreign brand (like, Coca Cola), it will be written in English.

My initial idea on solving this task was to get multilingual word embeddings (where similar words in different languages are located nearby) and find the distance between those texts. However, I am not sure how efficient this will be, and if it is ok, what to start with.

Because each text I have is just a set of product characteristics, some word embeddings based on a context may not work (I'm not sure in this statement, it is just my assumption).

So far, I have tried to get familiar with the MUSE framework, but I encountered an issue with faiss installation.

Hence, my questions are:

  • Is my idea with word embeddings worth trying?
  • Is there maybe a better approach?
  • If the idea with word embeddings is okay, which ones should I use?

Note: I have Windows 10 (in case some libraries don't work on Windows), and I need the library to work with Ukrainian and Russian languages.

Thanks in advance for any help! Any advice would be highly appreciated!


Solution

  • Let's say that your task is about a fine-grained entity recognition. I think you have a well defined entities: brand, size etc... So, these features that defines a product each could be a vector, which means your products could be represented with a matrix. You can potentially represent each feature with an embedding. Or mixture of the embedding and one-hot vectors.

    Here is how.

    1. Define a list of product features: product name, brand name, size, weight.
    2. For each product feature, you need a text recognition model: E.g. with brand recognition you find what part of the text is its brand name.
    3. Use machine translation if it is possible to make unified language representation for all sub texts. E.g. Coca Cola to ru Кока-Кола, en Coca Cola.
    4. Use contextual embeddings (i.e. huggingface multilingial BERT or something better) to convert prompted text into one vector.
    5. In order to compare two products, compare their feature vectors: what is the average similarity between two feature array. You can also decide what is the weight on each feature.
    6. Try other vectorization methods. Perhaps you dont want to mix brand knockoffs: "Coca Cola" is similar to "Cool Cola". So, maybe embeddings aren't good for brand names and size and weight but good enough for product names. If you want an exact match, you need a hash function for their text. On their multi-lingual prompt-engineered text.
    7. You can also extend each feature vectors, with concatenations of several embeddings or one hot vector of their source language and things like that.

    There is no definitive answer here, you need to experiment and test to see what is the best solution. You cam create a test set and make benchmarks for your solutions.