Search code examples
azureindexingazure-cognitive-searchazure-cognitive-servicesazure-form-recognizer

Extract tables from PDF to Azure search


I am trying to index pdf contents to azure search index. For that I am using the function apps for analyzeForm and extractTables project in the Azure Search Power Skills GitHub repository.

The AnalyzeForm has field mappings using which I am able to map the output fields to corresponding Azure search Index fields. But I also want to extract the table contents for which I will have to use ExtractTables API. The API returns the table record in this form :

{
    "values": [
        {
            "recordId": "record1",
            "data": {
                "tables": [
                    {
                        "page_number": 1,
                        "row_count": 3,
                        "column_count": 4,
                        "cells": [
                            {
                                "text": "Item",
                                "rowIndex": 0,
                                "colIndex": 0,
                                "confidence": 1.0,
                                "is_header": false
                            },
                            {
                                "text": "Quantity",
                                "rowIndex": 0,
                                "colIndex": 1,
                                "confidence": 1.0,
                                "is_header": false
                            },
                            {
                                "text": "Rate",
                                "rowIndex": 0,
                                "colIndex": 2,
                                "confidence": 1.0,
                                "is_header": false
                            },
                            {
                                "text": "Amount",
                                "rowIndex": 0,
                                "colIndex": 3,
                                "confidence": 1.0,
                                "is_header": false
                            },
                            {
                                "text": "Iphone 12 (64 GB)",
                                "rowIndex": 1,
                                "colIndex": 0,
                                "confidence": 1.0,
                                "is_header": false
                            },
                            {
                                "text": "1",
                                "rowIndex": 1,
                                "colIndex": 1,
                                "confidence": 1.0,
                                "is_header": false
                            },
                            {
                                "text": "$600.00",
                                "rowIndex": 1,
                                "colIndex": 2,
                                "confidence": 1.0,
                                "is_header": false
                            },
                            {
                                "text": "$600.00",
                                "rowIndex": 1,
                                "colIndex": 3,
                                "confidence": 1.0,
                                "is_header": false
                            },
                            {
                                "text": "MI 10 (6 GB)",
                                "rowIndex": 2,
                                "colIndex": 0,
                                "confidence": 1.0,
                                "is_header": false
                            },
                            {
                                "text": "1",
                                "rowIndex": 2,
                                "colIndex": 1,
                                "confidence": 1.0,
                                "is_header": false
                            },
                            {
                                "text": "$300.00",
                                "rowIndex": 2,
                                "colIndex": 2,
                                "confidence": 1.0,
                                "is_header": false
                            },
                            {
                                "text": "$300.00",
                                "rowIndex": 2,
                                "colIndex": 3,
                                "confidence": 1.0,
                                "is_header": false
                            }
                        ]
                    }
                ]
            }
        }
    ]
}

How can I map the data extracted from tables to index it into Azure search? Is there any way I can add this as a custom skill? If yes, what will be the index schema?


Solution

  • There's 3 ways to transpose a skill output you can either use a shaper skill, inline shaping or use a custom skill. In this instance I'll agree with @Thiago that the best option would be to use a custom skill as a shaper will not allow you to pick just the specific column value from the table result.

    If you are using the Power Skills, you could edit the skill to return a response that maps to your index field definition. Assuming your index field looks like:

        {
          "name": "Items",
          "type": "Collection(Edm.ComplexType)",
          "analyzer": null,
          "synonymMaps": [],
          "fields": [
            {
              "name": "Item_Id",
              "type": "Edm.String",
              "facetable": false,
              "filterable": false,
              "key": false,
              "retrievable": false,
              "searchable": false,
              "sortable": false,
              "analyzer": null,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "synonymMaps": [],
              "fields": []
            },
            {
              "name": "Quantity",
              "type": "Edm.Int64",
              "facetable": false,
              "filterable": false,
              "retrievable": false,
              "sortable": false,
              "analyzer": null,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "synonymMaps": [],
              "fields": []
            },
            {
              "name": "Rate",
              "type": "Edm.Double",
              "facetable": false,
              "filterable": false,
              "retrievable": false,
              "sortable": false,
              "analyzer": null,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "synonymMaps": [],
              "fields": []
            },
            {
              "name": "Amount",
              "type": "Edm.Double",
              "facetable": false,
              "filterable": false,
              "retrievable": false,
              "sortable": false,
              "analyzer": null,
              "indexAnalyzer": null,
              "searchAnalyzer": null,
              "synonymMaps": [],
              "fields": []
            }
          ]
        }
    

    You could edit your skill to return a JSON object like:

    {
        "items" : [
            {
                "item": "Iphone 12",
                "quantity": 1,
                "price": 1000,
                "amount: 1000
            }
        ]
    }