Search code examples
azureazure-cognitive-searchazure-cognitive-services

Azure Cognitive Search find similar documents use case


Please help me understand if what I am trying to do is possible to implement with Azure Cognitive Search.

I have a bunch of PDF files extracted and indexed as text (so I don't use the OCR build-in feature for the index, I prepare extracted PDF data with third-party tools) and I need somehow implement the feature called "find me similar documents in the index based on a new one document".
So as an input parameter for the search, I pass the extracted PDF text (that usually looks like a mess with new line symbols) that I want to use to find similar extracted PDF files in my index. That means they have a similar structure/company names/people etc.
Is that possible to do? I can't find any similar cases described in the documentation, but I assume it could be somehow configured with a full query search.

Please advise me am I moving in the right direction at all?


Solution

  • I think there are two possible ways:

    1-Implement an enrichment process during the injection that will pre classify the content.

    2-you use the semantic search feature and rely on it to return documents that are similar and relevant to the content you're searching for.

    EDIT

    I just noticed, there's a new feature called 'moreLikeThis' which is current in preview mode, but I believe it's what you're looking for:

    https://learn.microsoft.com/en-us/azure/search/search-more-like-this

    More info:

    https://learn.microsoft.com/en-us/azure/search/semantic-search-overview

    https://youtu.be/d_6ZNyV1MvA?t=619