Search code examples
pythonazure-cognitive-servicesopenai-apiazure-openai

How to get proper context from AI-Search documents when calling OpenAI completion endpoint


I found this documentation about chatting with your own data using AI-Search & OpenAI.

It works fine for my data however I don't get any additional context aside from the content and the score:

{"content":"<MY CONTENT>", "id":null,"title":null,"filepath":null,"url":null,"metadata":{"chunking":"orignal document size=2000. Scores=3.6962261Org Highlight count=31."},"chunk_id":"0"}

I think the additional fields in AI-Search need to be specified somewhere in the code but I don't know where and I couldn't find any example for it.

In the Azure OpenAI Chat Playground you can select the fields within your AI-Search Index. And then it is also correctly displayed in the sample chat app.

enter image description here

How can I achieve the same in my code using the code example referenced above?


Solution

  • I found the solution myself. So it turns out that you do not need to use the 'default' names for your AI-Search index fields. You can name your index fields whatever you want. However you need to map your field names to the expected default. Here is a working example:

    def ask_llm_citation(USER_INPUT:str, AZURE_OPENAI_SYSTEM_MESSAGE: str, NR_DOCUMENTS: int, STRICTNESS: int):
        def parse_multi_columns(columns: str) -> list:
            if "|" in columns:
                return columns.split("|")
            else:
                return columns.split(",")
    
        endpoint = config["OPENAI_API_BASE"]
        api_key = config["OPENAI_API_KEY"]
        # set the deployment name for the model we want to use
        deployment = config["OPENAI_API_GPT_DEPLOYMENT_NAME"]
    
        client = openai.AzureOpenAI(
            base_url=f"{endpoint}/openai/deployments/{deployment}/extensions",
            api_key=api_key,
            api_version="2023-09-01-preview"
        )
    
        response = client.chat.completions.create(
            messages=[{"role": "user", "content": USER_INPUT}],
            model=deployment,
            extra_body={
                "dataSources": [
                    {
                        "type": "AzureCognitiveSearch",
                        "parameters": {
                            "endpoint": ai_search["AZURE_COGNITIVE_SEARCH_ENDPOINT"],
                            "key": ai_search["AZURE_COGNITIVE_SEARCH_KEY"],
                            "indexName": ai_search["AZURE_COGNITIVE_SEARCH_INDEX_NAME"],
                            "fieldsMapping": {
                                "contentFields": parse_multi_columns("content"),
                                "urlField": "url_name",
                                "filepathField": "file_name",
                                "vectorFields": parse_multi_columns("content_vector")
                            },
                            "embeddingDeploymentName": config["OPENAI_API_DEPLOYMENT_NAME"],
                            "query_type":"vectorSimpleHybrid",
                            "inScope": True,
                            "roleInformation": AZURE_OPENAI_SYSTEM_MESSAGE,
                            "topNDocuments": NR_DOCUMENTS,
                            "strictness":  STRICTNESS
                        }
                    }
                ]
            },
            stream=True,
        )
        for chunk in response:
            delta = chunk.choices[0].delta
            yield delta
    

    Note: ContentFields and VectorFields need to be a list and not a string as multiple fields are possible here. That is why we need to convert it to a list.