I am using the Preview auf the Azure Search Blob Indexer. All of the information that should be indexed is contained in the blob metadata. While testing things out I ran into a problem with metadata encoding:
As Azure Storage Blobs Metadata values have to be valid HTTP Header values, we have to encode non-ASCII characters (see Invalid character exception when adding Metadata to a CloudBlob). The standard encoding for http header values, if i researched correctly, is Mime Header Encoding (as stated in https://www.ietf.org/rfc/rfc2047.txt).
When doing this, the indexer will contain encoded values, which is not great for searching. I have not found a way to make the blob indexer decode those values for the index fields, as the metadata fields are added verbatim (Source: https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/#ContentSpecificMetadata)
I know that Azure Blob Indexer is in Preview, but I am trying to document a few issues I am running into while trying to use Azure Search Blob Indexer!
this is on our radar. Please vote for this UserVoice suggestion to help us prioritize this work. We'll probably do this as a base-64 decoding capability because RFC2047 encoding is relatively obscure. Thanks!