I have a blob container where each folder represents an item that I have indexed in ACS. The folder name is the key for the item in the ACS index. Imagine the following container structure:
container {
item1 {
blob1,
blob2
},
item2 {
blob3
},
item3 {
blob4,
blob5,
blob6
}
}
I want to be able to run an indexer against the container, extract insights from the blobs with skills, like OcrSkill, KeyPhrases, EntityRecognition, etc. I know I can use ShaperSkill to get the information for a single blob/document into a format that I like. For example:
List<InputFieldMappingEntry> inputMappings = new List<InputFieldMappingEntry>();
inputMappings.Add(new InputFieldMappingEntry(
name: "content",
source: "/document/content"));
inputMappings.Add(new InputFieldMappingEntry(
name: "languageCode",
source: "/document/languageCode"));
inputMappings.Add(new InputFieldMappingEntry(
name: "keyPhrases",
source: "/document/keyPhrases"));
inputMappings.Add(new InputFieldMappingEntry(
name: "organizations",
source: "/document/organizations"));
inputMappings.Add(new InputFieldMappingEntry(
name: "name",
source: "/document/name"));
List<OutputFieldMappingEntry> outputMappings = new List<OutputFieldMappingEntry>();
outputMappings.Add(new OutputFieldMappingEntry(
name: "output",
targetName: "myDoc"));
ShaperSkill shaperSkill = new ShaperSkill(
description: "Shape to myDoc",
context: "/document",
name: "Doc Shaper",
inputs: inputMappings,
outputs: outputMappings);
And for the indexer itself, I can extract the folder name from the metadata_storage_path
like this:
List<FieldMapping> fieldMappings = new List<FieldMapping>();
fieldMappings.Add(new FieldMapping(
sourceFieldName: "metadata_storage_path",
targetFieldName: "key",
mappingFunction: FieldMappingFunction.ExtractTokenAtPosition("/", 4)));
What I don't know how to do (or if I can even do it) is to take multiple references to /document/myDoc
output field and get multiple entries into a collection in my ACS index. My desired output would be the following:
... (only showing relevant fields here)
{
"value": [
{
"key": "item1",
"myDocs": [
{
"name": "blob1",
"content": "<content from blob1>",
"languageCode": "<languageCode from blob1>",
"keyPhrases": "<keyPhrases from blob1>",
"organizations": "<organizations from blob1>"
},
{
"name": "blob2",
"content": "<content from blob2>",
"languageCode": "<languageCode from blob2>",
"keyPhrases": "<keyPhrases from blob2>",
"organizations": "<organizations from blob2>"
}
]
},
{
"key": "item2",
"myDocs": [
{
"name": "blob3",
"content": "<content from blob3>",
"languageCode": "<languageCode from blob3>",
"keyPhrases": "<keyPhrases from blob3>",
"organizations": "<organizations from blob3>"
}
]
},
{
"key": "item3",
"myDocs": [
{
"name": "blob4",
"content": "<content from blob4>",
"languageCode": "<languageCode from blob4>",
"keyPhrases": "<keyPhrases from blob4>",
"organizations": "<organizations from blob4>"
},
{
"name": "blob5",
"content": "<content from blob5>",
"languageCode": "<languageCode from blob5>",
"keyPhrases": "<keyPhrases from blob5>",
"organizations": "<organizations from blob5>"
},
{
"name": "blob6",
"content": "<content from blob6>",
"languageCode": "<languageCode from blob6>",
"keyPhrases": "<keyPhrases from blob6>",
"organizations": "<organizations from blob6>"
}
]
}
]
}
Does anyone know what I can do?
The indexer doesn't offer aggregation across multiple documents into a single index field since its change tracking may process a blob multiple times resulting in non-deterministic results. The solution is to create two indexes, one index for blobs and one index for parent records. You can either use an external process to read from the blob index to update the parent index in batches, which should have simpler aggregation logic but requires managing an external trigger; or use a Custom Web API skill to update the parent index as blobs are processed. The aggregation logic for the custom skill may be more complex to only selective add to the parent record if the child blob doesn't already exist. Check out the examples on setting up Azure Functions and connecting the skill to the function.