nlp azure-cognitive-services azure-cognitive-search azure-search-.net-sdk

How to page-wise index a blob document in Azure Cognitive Search?

I am new to Azure Search. I am indexing few pdf documents using this method But, I want to get search result page-wise. It is currently providing result from the whole document, but instead of that I want the result to be shown from each page and I also need that particular file name and page number that has the highest score.

Solution

As you have noticed, the document cracking by default shoves all text into one field (content). If you have an OCR skill involved (assuming you have images within the PDF that contain text), it does the same thing by default in merged_content. I do not believe there is a way to force these two tasks to break your data out into pages.

I say "believe" because it difficult to find documentation on the shape of the document object that is input into your skillsets. For example, look at the input to this merge skillset. It uses /document/content and other document related data and pushes it all into a field called merged_content. If you could find documentation on all the fields in document, it MIGHT have your pages broken down.

{
  "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
  "name": "#BookMergeSkill",
  "description": "Some description",
  "context": "/document",
  "insertPreTag": " ",
  "insertPostTag": " ",
  "inputs": [
    {
      "name": "text",
      "source": "/document/content"
    },
    {
      "name": "itemsToInsert",
      "source": "/document/normalized_images/*/text"
    },
    {
      "name": "offsets",
      "source": "/document/normalized_images/*/contentOffset"
    }
  ],
  "outputs": [
    {
      "name": "mergedText",
      "targetName": "merged_content"
    }
  ]
},

The only way I know to approach this is to use a custom skill, which would reside in an Azure Function and be called as part of the document skillset pipeline. Inside that Azure Function, you would have to use a PDF reader, like iText7, and crack open the documents yourself and return data that you would place in the index document as an array of text or custom objects.

We were going to go down a custom cracking process with a client (not to do this but for other reasons), but the project was canned due to the cost of holding large amounts of data within an index.