i am trying to understand a skillsets in Azure Cognitive Search. I want to build an Ocr powered search and i try to understend how it works.
For example documentation says taht ocr skill
produces response:
{
"text": "Hello World. -John",
"layoutText":
{
"language" : "en",
"text" : "Hello World. -John",
"lines" : [
{
"boundingBox":
[ {"x":10, "y":10}, {"x":50, "y":10}, {"x":50, "y":30},{"x":10, "y":30}],
"text":"Hello World."
},
{
"boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
"text":"-John"
}
],
"words": [
{
"boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
"text":"Hello"
},
{
"boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
"text":"World."
},
{
"boundingBox": [ {"x":110, "y":10}, {"x":150, "y":10}, {"x":150, "y":30},{"x":110, "y":30}],
"text":"-John"
}
]
}
}
but then in this paragraph we see, that only text
field from OCR skill is used and newcomer, contentOffset
is presented.
Custom skillset definition:
{
"description": "Extract text from images and merge with content text to produce merged_text",
"skills":
[
{
"description": "Extract text (plain and structured) from image.",
"@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
"context": "/document/normalized_images/*",
"defaultLanguageCode": "en",
"detectOrientation": true,
"inputs": [
{
"name": "image",
"source": "/document/normalized_images/*"
}
],
"outputs": [
{
"name": "text"
}
]
},
{
"@odata.type": "#Microsoft.Skills.Text.MergeSkill",
"description": "Create merged_text, which includes all the textual representation of each image inserted at the right location in the content field.",
"context": "/document",
"insertPreTag": " ",
"insertPostTag": " ",
"inputs": [
{
"name":"text",
"source": "/document/content"
},
{
"name": "itemsToInsert",
"source": "/document/normalized_images/*/text"
},
{
"name":"offsets",
"source": "/document/normalized_images/*/contentOffset"
}
],
"outputs": [
{
"name": "mergedText",
"targetName" : "merged_text"
}
]
}
]
}
and input should look like this:
{
"values": [
{
"recordId": "1",
"data":
{
"text": "The brown fox jumps over the dog",
"itemsToInsert": ["quick", "lazy"],
"offsets": [3, 28]
}
}
]
}
So how the array of offsets
(contentOffset
in skill definition) are coming from where OcrSkill
response not returning that and Read
method from computer vision not returning that as well from API?
contentOffset - is the default feature to extract content from the files are having images embedded init. So, whenever the OCR skillset recognizes images included in the input document, contentOffset
is called.
To answer the reason for coming array of contentOffset is due to having multiple images in every input we are uploading for analyzing. Consider the following documentation for ReadAPI through REST to follow the JSON operations.