Search code examples
azureazure-cognitive-search

Azure Blob Indexer metadata fields, encoding


I am using the Preview auf the Azure Search Blob Indexer. All of the information that should be indexed is contained in the blob metadata. While testing things out I ran into a problem with metadata encoding:

As Azure Storage Blobs Metadata values have to be valid HTTP Header values, we have to encode non-ASCII characters (see Invalid character exception when adding Metadata to a CloudBlob). The standard encoding for http header values, if i researched correctly, is Mime Header Encoding (as stated in https://www.ietf.org/rfc/rfc2047.txt).

When doing this, the indexer will contain encoded values, which is not great for searching. I have not found a way to make the blob indexer decode those values for the index fields, as the metadata fields are added verbatim (Source: https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/#ContentSpecificMetadata)

I know that Azure Blob Indexer is in Preview, but I am trying to document a few issues I am running into while trying to use Azure Search Blob Indexer!


Solution

  • this is on our radar. Please vote for this UserVoice suggestion to help us prioritize this work. We'll probably do this as a base-64 decoding capability because RFC2047 encoding is relatively obscure. Thanks!