Search code examples
azureindexingazure-airagvector-search

Issue with Azure AI Search: Mismatch in Vector Dimensions When Indexing Chunked Documents


I’m currently building a Retrieval-Augmented Generation (RAG) system using Azure AI Search, and I've run into a problem with my index/indexer and skillset when handling chunked documents.

Overview of My Setup:

  • I import data via SharePoint (which works fine).
  • I chunk my files into pages.
  • Each page is embedded using the ada-002 model.
  • After embedding, I feed the resulting vector to my index for search.

Issue:

When I try to feed my indexer with the chunked pages, I occasionally encounter a dimension mismatch error. Specifically, I receive the following error:

There's a mismatch in vector dimensions. The vector field 'content_embeddings', with dimension of '1536',
expects a length of '1536'. However, the provided vector has a length of '3072'. 
Please ensure that the vector length matches the expected length of the vector field.

Observations:

  • When the documents are not chunked (i.e., the document is smaller than the chunk limit), they are indexed successfully without issues.
  • The problem arises specifically when the documents are chunked.

Troubleshooting Steps I’ve Taken:

  1. Output Validation: I log the dimensions of the vectors produced after embedding and before indexing. Typically, they are correctly sized at 1536, but the chunked vectors sometimes yield oversized dimensions.
  2. Chunking Logic: I’ve ensured that my chunking process doesn’t combine multiple chunks or overlap content, but the issue still persists.
  3. Indexer Configuration: I reviewed the indexer setup to ensure it correctly maps to the content_embeddings field.

Assumption:

I suspect that the issue may be due to two vectors being concatenated, resulting in the oversized dimensions. However, I'm not sure how to solve this problem.

My Question:

  • What could be causing this dimension mismatch specifically when I attempt to index the chunked embeddings?
  • Are there any specific configurations or best practices in Azure AI Search regarding embedding vectors that I should consider?

Code Snippets:

Skillset Configuration:

  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "SplitSkill",
      "description": "A skill that splits text into chunks",
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "maximumPagesToTake": 0,
      "unit": "azureOpenAITokens",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ],
      "azureOpenAITokenizerParameters": {
        "encoderModelName": "cl100k_base",
        "allowedSpecialTokens": [
          "[START]",
          "[END]"
        ]
      }
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "ContentEmbeddingSkill",
      "description": "Connects to Azure OpenAI deployed embedding model to generate embeddings from content.",
      "context": "/document/pages/*",
      "resourceUri": "https://xxxx.openai.azure.com",
      "apiKey": "<redacted>",
      "deploymentId": "text-embedding-ada-002",
      "dimensions": 1536,
      "modelName": "text-embedding-ada-002",
      "inputs": [
        {
          "name": "text",
          "source": "/document/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "content_embeddings"
        }
      ],
      "authIdentity": null
    }

Indexer Configuration:

{
  "@odata.context": "xxxxxxxxxxx",
  "@odata.etag": "xxxxxxxxxxx",
  "name": "xxxxxxxxxxx-vector",
  "description": null,
  "dataSourceName": "sharepoint-datasource",
  "skillsetName": "contentembedding",
  "targetIndexName": "sharepoint-index",
  "disabled": null,
  "schedule": null,
  "parameters": {
    "batchSize": 10,
    "maxFailedItems": 100,
    "maxFailedItemsPerBatch": null,
    "base64EncodeKeys": null,
    "configuration": {
      "indexedFileNameExtensions": ".csv, .docx, .pptx,.txt,.html,.pdf",
      "excludedFileNameExtensions": ".png, .jpg, .gif",
      "dataToExtract": "contentAndMetadata"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "content",
      "targetFieldName": "content",
      "mappingFunction": null
    }
  ],
  "outputFieldMappings": [
    {
      "sourceFieldName": "/document/pages",
      "targetFieldName": "pages",
      "mappingFunction": null
    },
    {
      "sourceFieldName": "/document/pages/*/content_embeddings/*",
      "targetFieldName": "content_embeddings",
      "mappingFunction": null
    }
  ],
  "cache": null,
  "encryptionKey": null
}

I appreciate any insights or suggestions on how to resolve this issue!


Solution

  • What could be causing this dimension mismatch specifically when I attempt to index the chunked embeddings?

    If multiple chunked embeddings (e.g., 1536-dimension vectors) are concatenated unintentionally before being passed into the indexer, the resulting vector could become too large (e.g., 3072 for two concatenated vectors). This happens if the chunking process produces multiple embeddings per document, and these embeddings are concatenated instead of handled individually.

    • Check that the skill operates on one chunk at a time and outputs a single vector for each chunk. Avoid concatenating multiple embeddings from different chunks.
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "ContentEmbeddingSkill",
      "context": "/document/pages/*",
      "dimensions": 1536,
      "inputs": [
        {
          "name": "text",
          "source": "/document/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "content_embeddings"
        }
      ]
    }
    
    • Set up an indexer outputFieldMappings should map each chunk's embedding to a separate instance of the content_embeddings field.
    {
      "dataSourceName": "your-datasource",
      "skillsetName": "your-skillset",
      "targetIndexName": "your-index",
      "outputFieldMappings": [
        {
          "sourceFieldName": "/document/pages/*/content_embeddings",
          "targetFieldName": "content_embeddings"
        }
      ]
    }
    

    Reference: