Search code examples
azureazure-cognitive-searchazure-ai-search

How to split document by page in Azure AI Search?


As title.

I have several PDFs stored in Azure blob and entered Azure AI Search and using SplitSkill.

However, even if textSplitMod is set to pages, I still can't split document by pages.

The skillset JSON code is as follows:

{
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#2",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "maximumPagesToTake": 0,
      "inputs": [
        {
          "name": "text",
          "source": "/document/mergedText"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }

How can I achieve the goal of splitting according to page numbers?

Because I want the search output to show the answer along with the corresponding page number.


Solution

  • The text split skill breaks documents into chunks, which are used for further processing by other cognitive skills.

    Below, you can see that I have added the field mappings of the output of the text split skill to the index. Even though all pages are indexed at the same source document, when using it in other cognitive skills by providing input like /document/mypages/*, it processes each page.

    enter image description here

    Below is the sample I used for language detection skill on each page.

    {
      "@odata.context": "https://jgsai.search.windows.net/$metadata#skillsets/$entity",
      "@odata.etag": "\"0x8DC37696B99977C\"",
      "name": "skillset1709010100983",
      "description": "",
      "skills": [
        {
          "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
          "name": "#1",
          "description": null,
          "context": "/document",
          "defaultLanguageCode": "en",
          "textSplitMode": "pages",
          "maximumPageLength": 1000,
          "pageOverlapLength": 0,
          "maximumPagesToTake": 0,
          "inputs": [
            {
              "name": "text",
              "source": "/document/content"
            },
            {
              "name": "languageCode",
              "source": "/document/language"
            }
          ],
          "outputs": [
            {
              "name": "textItems",
              "targetName": "mypages"
            }
          ]
        },
        {
          "@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
          "name": "#2",
          "description": "",
          "context": "/document/mypages/*",
          "defaultCountryHint": "in",
          "modelVersion": "latest",
          "inputs": [
            {
              "name": "text",
              "source": "/document/mypages/*"
            }
          ],
          "outputs": [
            {
              "name": "languageCode",
              "targetName": "languageCode"
            },
            {
              "name": "languageName",
              "targetName": "languageName"
            },
            {
              "name": "score",
              "targetName": "score"
            }
          ]
        }
      ],
      "cognitiveServices": {
        "@odata.type": "#Microsoft.Azure.Search.DefaultCognitiveServices",
        "description": null
      },
      "knowledgeStore": null,
      "indexProjections": null,
      "encryptionKey": null
    }
    

    However, what you are asking about getting pages in the index cannot be done. You can either use the output of the text split skill, or refer to the knowledge store to create each of the pages and create a new separate index with those pages.