Search code examples
azure-blob-storageazure-cognitive-search

Unable to decode blob metadata into Azure Search index


I'm using an indexer to index PDF files into an Azure Search Index. I have some metadata parameters encoded as URL-safe base64 (document_url in the screenshot):

blob metadata

Everything works fine. The indexer runs and the document_url is decoded and indexed in a Url property.

The problem comes when I try to do the same for a metadata_title parameter. Configured in exactly the same way as 'document_url', when the indexer runs it throws an error.

Message: Could not parse document. Could not apply mapping function 'base64Decode' to field 'metadata_title'. Details: The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters.

The metadata_title has the following value: Tm9uLVNtb2tlcsKQcyBEZWNsYXJhdGlvbg

Using an external tool I'm able to decode that without issues.

The mapping configuration for both 'document_url' and 'metadata_title' are the same:

{
  "sourceFieldName": "metadata_title",
  "targetFieldName": "MetaTitle",
  "mappingFunction": {
     "name": "base64Decode",
     "parameters": {
         "useHttpServerUtilityUrlTokenDecode": false
     }
  }
}

Even if I remove the 'metadata_title' property from the blob, it keeps throwing the same error.

Maybe the problem is in the Index?

This is where the metadata_title is being mapped:

{
      "name": "MetaTitle",
      "type": "Edm.String",
      "searchable": true,
      "filterable": false,
      "retrievable": true,
      "sortable": false,
      "facetable": false,
      "key": false,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "analyzer": null,
      "normalizer": null,
      "synonymMaps": []
 }

This property is searchable, while Url (the property where 'document_url' is being mapped) is not. That's the only difference I can see.


Solution

  • 'metadata_title' is one of the reserved metadata properties for PDFs when using the blob indexer with "parsingMode": "default" and "dataToExtract": "contentAndMetadata", so I suspect what is happening is that the property that is attempting to be decoded is the one that ACS extracted from your document (which is not base 64 encoded) and not the one from your blob's metadata. To get around this, you could need to rename the blob metadata field to have a name that is not one of the reserved properties that ACS extracts on your behalf. If that is not feasible, please open a support ticket so that one of our engineers can work with you on a more custom solution.