As title.
I have several PDFs stored in Azure blob and entered Azure AI Search and using SplitSkill
.
However, even if textSplitMod
is set to pages, I still can't split document by pages.
The skillset JSON code is as follows:
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"name": "#2",
"description": "Split skill to chunk documents",
"context": "/document",
"defaultLanguageCode": "en",
"textSplitMode": "pages",
"maximumPageLength": 2000,
"pageOverlapLength": 500,
"maximumPagesToTake": 0,
"inputs": [
{
"name": "text",
"source": "/document/mergedText"
}
],
"outputs": [
{
"name": "textItems",
"targetName": "pages"
}
How can I achieve the goal of splitting according to page numbers?
Because I want the search output to show the answer along with the corresponding page number.
The text split skill breaks documents into chunks, which are used for further processing by other cognitive skills.
Below, you can see that I have added the field mappings of the output of the text split skill to the index. Even though all pages are indexed at the same source document, when using it in other cognitive skills by providing input like /document/mypages/*
, it processes each page.
Below is the sample I used for language detection skill on each page.
{
"@odata.context": "https://jgsai.search.windows.net/$metadata#skillsets/$entity",
"@odata.etag": "\"0x8DC37696B99977C\"",
"name": "skillset1709010100983",
"description": "",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"name": "#1",
"description": null,
"context": "/document",
"defaultLanguageCode": "en",
"textSplitMode": "pages",
"maximumPageLength": 1000,
"pageOverlapLength": 0,
"maximumPagesToTake": 0,
"inputs": [
{
"name": "text",
"source": "/document/content"
},
{
"name": "languageCode",
"source": "/document/language"
}
],
"outputs": [
{
"name": "textItems",
"targetName": "mypages"
}
]
},
{
"@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
"name": "#2",
"description": "",
"context": "/document/mypages/*",
"defaultCountryHint": "in",
"modelVersion": "latest",
"inputs": [
{
"name": "text",
"source": "/document/mypages/*"
}
],
"outputs": [
{
"name": "languageCode",
"targetName": "languageCode"
},
{
"name": "languageName",
"targetName": "languageName"
},
{
"name": "score",
"targetName": "score"
}
]
}
],
"cognitiveServices": {
"@odata.type": "#Microsoft.Azure.Search.DefaultCognitiveServices",
"description": null
},
"knowledgeStore": null,
"indexProjections": null,
"encryptionKey": null
}
However, what you are asking about getting pages in the index cannot be done. You can either use the output of the text split skill, or refer to the knowledge store to create each of the pages and create a new separate index with those pages.