Search code examples
azure-cognitive-searchazure-search-.net-sdk

Azure Search - Cannot merge (with skill) data obtained from the KeyPhraseExtractionSkill


I am creating an indexer that takes a document, runs the KeyPhraseExtractionSkill and outputs it back to the index.

For many documents, this works out of the box. But for those records which are over 50,000, this does not work. OK, no problem; this is clearly stated in the docs.

What the docs suggest is so use the Text Split Skill. What I've done is use the Text Split skill, split the original document into pages, pass all pages to the KeyPhraseExtractionSkill. Then we need to merge them back, as we'd end up with an array of arrays of strings. Unfortunately, it seems that the Merge Skill does not accept an array of arrays, just an array.

https://i.sstatic.net/8UmYj.png <- Link to the skillset hierarchy.

This is the error reported by Azure:

Required skill input was not of the expected type 'StringCollection'. Name: 'itemsToInsert', Source: '/document/content/pages/*/keyPhrases'. Expression language parsing issues:

What I want to achieve in the end of the day is to run the KeyPhraseExtractionSkill for text which is larger than 50,000 to add it back to the index eventually.

JSON for skillset

  "@odata.context": "https://-----------.search.windows.net/$metadata#skillsets/$entity",
  "@odata.etag": "\"0x8D957466A2C1E47\"",
  "name": "devalbertcollectionfilesskillset2",
  "description": null,
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "SplitSkill",
      "description": null,
      "context": "/document/content",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 1000,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
      "name": "EntityRecognitionSkill",
      "description": null,
      "context": "/document/content/pages/*",
      "categories": [
        "person",
        "quantity",
        "organization",
        "url",
        "email",
        "location",
        "datetime"
      ],
      "defaultLanguageCode": "en",
      "minimumPrecision": null,
      "includeTypelessEntities": null,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "persons",
          "targetName": "people"
        },
        {
          "name": "organizations",
          "targetName": "organizations"
        },
        {
          "name": "entities",
          "targetName": "entities"
        },
        {
          "name": "locations",
          "targetName": "locations"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
      "name": "KeyPhraseExtractionSkill",
      "description": null,
      "context": "/document/content/pages/*",
      "defaultLanguageCode": "en",
      "maxKeyPhraseCount": null,
      "modelVersion": null,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "keyPhrases",
          "targetName": "keyPhrases"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "name": "Merge Skill - keyPhrases",
      "description": null,
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name": "itemsToInsert",
          "source": "/document/content/pages/*/keyPhrases"
        }
      ],
      "outputs": [
        {
          "name": "mergedText",
          "targetName": "keyPhrases"
        }
      ]
    }
  ],
  "cognitiveServices": {
    "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
    "key": "------",
    "description": "/subscriptions/13abe1c6-d700-4f8f-916a-8d3bc17bb41e/resourceGroups/mde-dev-rg/providers/Microsoft.CognitiveServices/accounts/mde-dev-cognitive"
  },
  "knowledgeStore": null,
  "encryptionKey": null
}```

Please let me know if there is anything else that I can add to improve the question. Thanks!


  [1]: https://i.sstatic.net/GNf7F.png

Solution

  • You don't have to merge the key phrase outputs to insert them to the index.

    Assuming your index already has a field called mykeyphrases of type Collection(Edm.String), to populate it with the key phrase outputs, add this indexer output field mapping:

    "outputFieldMappings": [
      ...
    
      {
        "sourceFieldName": "/document/content/pages/*/keyPhrases/*",
        "targetFieldName": "mykeyphrases"
      },
    
      ...
    ]
    

    The /* at the end of sourceFieldName is important to flattening the array of arrays of strings. This will also work as the skill input if you want to pass an array of strings to another skill for other enrichments.