Search code examples
azureazure-openaiazure-ai-search

Add dynamic json document to Azure AI Search without defining field names explicitly


I have been looking at Azure AI Search to store dynamic json documents for vector search.

Understand that we can define json schema in AI search like below for a structured json. But, wonder if there is a way to store json payload without a unified structure. for example, it can store json file 1/2/3 (each with different structure in side content field ideally as a json type not as string type)

Also tried to just dump the json as a string and store in the content field but not sure if there is a better way to do it that will be better for vector search later on.

enter image description here


Solution

  • Below are the possible ways.

    1. You make whole json content as string and create a vector_content out of it. When you do vector search this includes all the Json content.

    Sample code to preprocess the data and upload to index.

    import openai
    
    openai.api_type = "azure"
    openai.api_key = "YOUR_API_KEY"
    openai.api_base = "https://YOUR_RESOURCE_NAME.openai.azure.com"
    openai.api_version = "2024-06-01"
    
    documents=[]
    files=["cosmosdata.json","file2.json"...]
    
    for file in files:
        data = json.dumps(json.load(open(file)))
    
        response = openai.Embedding.create(
        input=data,
        engine="YOUR_DEPLOYMENT_NAME")
    
        embeddings = response['data'][0]['embedding']
        obj = {
            "id":1,
            "content":data,
            "content_vector": embeddings
            #Add required fields from metadata
        }
    
        documents.append(obj)
    
    result = search_client.upload_documents(documents=[documents])
    

    You alter above code accordingly, check this document for generating embeddings and creation of index

    1. If Json content is partially dynamic, that is some of the fields are common to all records where you need to do filtering then you just create a schema for common fields and keep remaining content as string.

    2. You can generate embeddings only on relevant parts of the JSON content to improve search quality, so to create index in like you need to pre-process the documents and upload it to index.

    Example, below 2 different Json data.

        {
            "id": "2",
            "Color2": "WHITE-CREAM"
        },
            {
            "id": "10",
            "Color3": "WHITE"
        }
    

    Here, id is same and color different.

    In this case you create index fields id and uncommon_content with color data and vector field using the uncommon_content, here you only include relevant parts for embeddings.