I am trying to index pdf contents to azure search index. For that I am using the function apps for analyzeForm and extractTables project in the Azure Search Power Skills GitHub repository.
The AnalyzeForm has field mappings using which I am able to map the output fields to corresponding Azure search Index fields. But I also want to extract the table contents for which I will have to use ExtractTables API. The API returns the table record in this form :
{
"values": [
{
"recordId": "record1",
"data": {
"tables": [
{
"page_number": 1,
"row_count": 3,
"column_count": 4,
"cells": [
{
"text": "Item",
"rowIndex": 0,
"colIndex": 0,
"confidence": 1.0,
"is_header": false
},
{
"text": "Quantity",
"rowIndex": 0,
"colIndex": 1,
"confidence": 1.0,
"is_header": false
},
{
"text": "Rate",
"rowIndex": 0,
"colIndex": 2,
"confidence": 1.0,
"is_header": false
},
{
"text": "Amount",
"rowIndex": 0,
"colIndex": 3,
"confidence": 1.0,
"is_header": false
},
{
"text": "Iphone 12 (64 GB)",
"rowIndex": 1,
"colIndex": 0,
"confidence": 1.0,
"is_header": false
},
{
"text": "1",
"rowIndex": 1,
"colIndex": 1,
"confidence": 1.0,
"is_header": false
},
{
"text": "$600.00",
"rowIndex": 1,
"colIndex": 2,
"confidence": 1.0,
"is_header": false
},
{
"text": "$600.00",
"rowIndex": 1,
"colIndex": 3,
"confidence": 1.0,
"is_header": false
},
{
"text": "MI 10 (6 GB)",
"rowIndex": 2,
"colIndex": 0,
"confidence": 1.0,
"is_header": false
},
{
"text": "1",
"rowIndex": 2,
"colIndex": 1,
"confidence": 1.0,
"is_header": false
},
{
"text": "$300.00",
"rowIndex": 2,
"colIndex": 2,
"confidence": 1.0,
"is_header": false
},
{
"text": "$300.00",
"rowIndex": 2,
"colIndex": 3,
"confidence": 1.0,
"is_header": false
}
]
}
]
}
}
]
}
How can I map the data extracted from tables to index it into Azure search? Is there any way I can add this as a custom skill? If yes, what will be the index schema?
There's 3 ways to transpose a skill output you can either use a shaper skill, inline shaping or use a custom skill. In this instance I'll agree with @Thiago that the best option would be to use a custom skill as a shaper will not allow you to pick just the specific column value from the table result.
If you are using the Power Skills, you could edit the skill to return a response that maps to your index field definition. Assuming your index field looks like:
{
"name": "Items",
"type": "Collection(Edm.ComplexType)",
"analyzer": null,
"synonymMaps": [],
"fields": [
{
"name": "Item_Id",
"type": "Edm.String",
"facetable": false,
"filterable": false,
"key": false,
"retrievable": false,
"searchable": false,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "Quantity",
"type": "Edm.Int64",
"facetable": false,
"filterable": false,
"retrievable": false,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "Rate",
"type": "Edm.Double",
"facetable": false,
"filterable": false,
"retrievable": false,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "Amount",
"type": "Edm.Double",
"facetable": false,
"filterable": false,
"retrievable": false,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
]
}
You could edit your skill to return a JSON object like:
{
"items" : [
{
"item": "Iphone 12",
"quantity": 1,
"price": 1000,
"amount: 1000
}
]
}