nlp azure-machine-learning-service azure-cognitive-services

Comparing texts based on their meanings

We have a pool of documents (word and plain texts) that could include as many as 1000, 2000 or even more items. Each document may contain thousands of words. There is one reference document given to us that we should find the closest matches to this reference document semantically from the pool.

We first used SQL Server 2017's semantic search feature but it's not returning more than 10 records which is a limitation! What other technologies or tools are out there in the market to serve this purpose. We prefer to leverage Microsoft's cognitive tools and services but we are open to any other options including the open source that can help.

Solution

I would recommend looking into TF-IDF approaches if the documents are of a technical nature. TF-IDFs look at the frequencies of terms (TF) in a document and multiply it with the inverse document frequency (IDF), a measure of the scarcity of the term in the overall corpus. The thinking there is: A word that you use often, but is very scarcely used in the overall corpus, is likely to make it an important term for the meaning of the document. A similarity measure (such as Cosine similarity) is then applied to the TFIDF to find documents with a similar profile in terms of TFIDF scores (i.e. a similar over-usage of the relatively unique terms)

If the texts are less technical in nature, you could take a look at Word Embedding approaches such as Document2Vec - basically they use trained sets with multi-dimensional vectors. These multi-dimensional vectors try to capture the meaning of a word, which means you are not dependent on the same keywords being used (which is the case with TFIDF).

Existing implementations are around (especially Python based), but Azure can probably facilitate these technologies as well (c.f. HDInsight https://learn.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/natural-language-processing). You can also look up ElasticSearch that does some of these things out of the box.