azure azure-cognitive-search azure-search-.net-sdk

Azure Cognitive Search - Index binary data (MS Office files) from an external data source (no Azure Blob)

I'm trying to understand if there is a way, and how to achieve it, to index binary data (mostly MS Office Documents and PDFs) that do not reside in Azure Blob Storage but on other non-azure data sources.

The closest example I found copies the files to an Azure blob container and then add a skillset to index these docs from there.

I would like to bypass the Azure blob container, and push the doc metadata as well as the binary content directly.

Any advise or example I can look at?

Thanks

Solution

You can define custom skillsets with both custom and built-in skills when you push data to the index. There is Document Extraction skill that does what you want. See:

https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-document-extraction