I have an Azure Storage container which contains a mix of files (pdf, doc, docx, jpg, png, ...) stored as blobs.
I'm trying to use the Azure Search blob indexer to index the meta data for all files (including images), and where possible, extract the content for full text searching (obviously images don't have any extractable text content). The idea behind wanting to extract image metadata is that I want an entry in the search index for an image because I have additional data in DocumentDB that I want to manually merge in to the search index using a WebJob.
Using the Azure Portal I have added the data source, index and indexer, however, when the indexer runs, it's failing with the following error:
Document 'https://xxx.blob.core.windows.net/xxx/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx-v1' has unsupported content type 'image/jpeg'
Reading the documentation on https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/#using-custom-metadata-to-control-document-extraction it mentions that if I add metadata to the blob with a key of "AzureSearch_SkipContent" and a value of "true" then it should not attempt to try extracting content.
After adding the "AzureSearch_SkipContent" metadata to all content types not listed in the table on https://azure.microsoft.com/en-us/documentation/articles/search-howto-indexing-azure-blob-storage/#content-type-specific-metadata-properties , the indexer is still failing with the error above.
If I add "AzureSearch_Skip" metadata set to "true" then the indexer does skip the image blob, but then I don't have anything in the index for it - which is not what I want.
Here is an example of the steps I'm trying to achieve:
So, should it be possible to add "AzureSearch_SkipContent" to an image blob and have something appear in the search index for it? Or is my only solution to "AzureSearch_Skip" it completely and then manually add something in to the search index for it?
AzureSearch_SkipContent
flag only works for supported content types, where Azure Search can extract content-type specific metadata.
Azure Search also supports indexing only the storage metadata and skipping content type metadata and content extraction - in this case, the content type doesn't matter. However, this setting is only available at the indexer scope and applies to all blobs. See Index storage metadata only.
We've heard similar question from several customers, so we're adding another switch that will behave as follows:
It looks like this will be helpful in your case.
UPDATE on Dec 7, 2016:
This functionality is now available. To continue indexing when an unsupported content type is encountered, set the failOnUnsupportedContentType
configuration parameter to false:
PUT https://[service name].search.windows.net/indexers/[indexer name]?api-version=2016-09-01
Content-Type: application/json
api-key: [admin key]
{
... other parts of indexer definition
"parameters" : { "configuration" : { "failOnUnsupportedContentType" : false } }
}
For more info, see Controlling which blobs are indexed