azure-cognitive-services azure-cognitive-search azure-openai

Azure Cognitive Search indexing considerations

I am trying to leverage the preview feature of Azure OpenAI Service for bringing your own data. I have a large blob storage containing thousands of hundreds of documents (.pdf,.docx,.xls, etc.) and I would like to be able to query them with some filtration behind the scenes, e.g., "Provide me with a summary for price docs" returns summary filtered by custom field which I've applied through code. Trying to follow RAG pattern, but here are some issues:

Considered pull model for cognitive search, but here custom field for filtration cannot be added, because we don't have it as blob metadata. Also, vector representation of content cannot be added.
Considered push model, but only JSON can be pushed. Here we can add the custom field for filtration and also enable vector representation, but Document Extraction cognitive skill cannot be called from code(.NET) and text extraction is not possible.

What is the best approach here and are there any foreseeable obstacles to connect this cognitive search to the OpenAI Service at a later point?

Solution

Pulling data from Cognitive Search and pushing data to Cognitive Search will both result to the same thing: an index, in a JSON format. The only difference is how you populate your index:

when 'pulling' data, you define an indexer which will be in charge of accessing the data where it is located, process a few operations (enrichment, chunking, embeddings generation etc) and add the result to the index
when 'pushing' data, you are calling Search API (directly or via SDK) to push the content into the index

So there is no "better option" from my point of view for what you are trying to achieve.

You will also struggle to generate summaries if your 'source documents' are split (/chunked) into several search items (aka documents) as you might not retreive all the content to generate the summary