Search code examples
azureazure-ai-search

If I am chunking and vectorizing content into Azure AI Search do I create a new item for each embedding, won't I have duplicate documents?


I am trying to set up a vector search in Azure AI Search. I have some documents that are a few hundred pages long, I want to vectorize the content. Which means I will need to chunk the text. The problem I am having is that after I chunk the text and generate the embedding I don't know how to recombine them (or if I should).

The main issue that I am having is that it seems like I need to create duplicate entries for each chunk of text, which means my search results are going to return the same document multiple times. Is this accurate?

Is there an example of how to chunk large documents (not from storage) and index them into Azure AI search in a way that does not require having the same document multiple times in the index? I am not using Azure storage, so it seems like I have no option to use the integrated vectorization which apparently handles this problem somehow.


Solution

  • When you do chunking, all the chunks are stored under a single document reference in an array, like below:

    enter image description here

    So, it seems like duplicate documents.

    To create a separate document for each chunk, you need to have a secondary index and define index projection in the skillset for the secondary index.

    Below is the definition of the skillset. Give this skillset in the indexer, which is targeted to the secondary index.

    {
      "@odata.context": "https://jgsearch.search.windows.net/$metadata#skillsets/$entity",
      "@odata.etag": "\"0x8DC681E216E4055\"",
      "name": "vector-1714372801582-skillset",
      "description": "Skillset to chunk documents and generate embeddings",
      "skills": [
        {
          "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
          "name": "#1",
          "description": null,
          "context": "/document/pages/*",
          "resourceUri": "https://jgsopenai.openai.azure.com",
          "apiKey": "<redacted>",
          "deploymentId": "ada-002",
          "inputs": [
            {
              "name": "text",
              "source": "/document/pages/*"
            }
          ],
          "outputs": [
            {
              "name": "embedding",
              "targetName": "vector"
            }
          ],
          "authIdentity": null
        },
        {
          "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
          "name": "#2",
          "description": "Split skill to chunk documents",
          "context": "/document",
          "defaultLanguageCode": "en",
          "textSplitMode": "pages",
          "maximumPageLength": 2000,
          "pageOverlapLength": 500,
          "maximumPagesToTake": 0,
          "inputs": [
            {
              "name": "text",
              "source": "/document/content"
            }
          ],
          "outputs": [
            {
              "name": "textItems",
              "targetName": "pages"
            }
          ]
        }
      ],
      "cognitiveServices": null,
      "knowledgeStore": null,
      "indexProjections": {
        "selectors": [
          {
            "targetIndexName": "secondary-index",
            "parentKeyFieldName": "parent_id",
            "sourceContext": "/document/pages/*",
            "mappings": [
              {
                "name": "chunk",
                "source": "/document/pages/*",
                "sourceContext": null,
                "inputs": []
              },
              {
                "name": "vector",
                "source": "/document/pages/*/vector",
                "sourceContext": null,
                "inputs": []
              },
              {
                "name": "title",
                "source": "/document/metadata_storage_name",
                "sourceContext": null,
                "inputs": []
              }
            ]
          }
        ],
        "parameters": {
          "projectionMode": "skipIndexingParentDocuments"
        }
      },
      "encryptionKey": null
    }
    

    This results in a separate document for each chunk.

    enter image description here

    As you said, you can also do this with integrated vectorization, which is in preview without doing all of this manually.

    You can also achieve this without doing index projection. Temporarily, you need to store the chunked data in storage using a custom web API and load it in the secondary index.

    For more about this method, refer to this Stack solution.