Search code examples
azure-cognitive-searchazure-blob-storage

Using Azure Search for PDFs in Azure Blob Storage


We are trying to enable full text search. Application stores PDF files in the Azure Blob Storage, which is the data source for Azure Search. Majority of this works fine however the Indexer is not able to extract text from couple of PDFs. Are there any specific kinds of PDFs that Azure Search Indexer can extract?. If Yes, What are they?

Any information, Help/Support in this regard greatly appreciated.


Solution

  • Azure Search can extract all text from PDF text elements. Extracting text from embedded images (which requires OCR) or tables is not yet integrated in Azure Search, but it is on the roadmap.

    If your PDFs contain images and you want to extract text from those as well, then you can try following the steps here.