Search code examples
azure-cognitive-searchazure-blob-storageazure-cognitive-services

Do I have to store PDF files in Azure Blob Storage to OCR and index them?


I'm testing Azure Search to index my website for searching.

I have created an index and I'm able to get the info from the website pages and push them to the index.

My question is regarding indexing the content from let's say PDF files, both the text and using cognitive services to extract the text from images within the PDF files.

In tutorials related to indexing PDF files it seems to be assumed that the PDF files are in a location accessible by a Search Indexer like Azure Blob Storage. So it would seem that I would have to take all the PDF files that are already in my website and store them in Azure Blob Storage (somehow saving their original URL somewhere) so that I can then index them and extract the content using a data source - indexer - index.

The functionality I'm looking for is that you go to my website, search text that could be in a PDF file text or within an image and as a search result you get the original URL to the PDF file (not the Azure storage URL).

Is it possible to index the content of the PDF files directly from my website (including the cognitive services) with the Azure REST API? Or do I have to put these files in Azure Blob Storage first and if I did how would I preserve/save the URL so that when the indexer runs and extracts the content I can add the original file URL to the index?


Solution

  • Currently , Azure search supports platforms as data source below:

    • Blob storage
    • Table storage
    • Azure Cosmos DB
    • Azure SQL database, and SQL Server on Azure VMs

    So if you want to index your pdfs , you should store them in Azure storage so that Azure search can exact content and index them .

    If you want to involve the original file URL into your index , you can add an user-defined metadata for your pdf blob, ie, "originalUrl": enter image description here

    so that it will be index by Azure search : enter image description here enter image description here

    Hope it helps .