Search code examples
azure-openaiopenaiembeddings

Can tiktoken be used instead of text-embedding-ada-002 for generating text embeddings for Azure AI Search?


I'm currently tokenizing documents with a text-embedding-ada-002 on Azure. The document contents and their tokens are uploaded to Azure AI Search. We then use a gpt-35-turbo-16k deployment to search these documents (with this API: https://learn.microsoft.com/en-us/azure/ai-services/openai/reference?WT.mc_id=AZ-MVP-5004796#example-request-3).

To reduce operational costs and bypass rate limiting problems, can we use tiktoken instead of text-embedding-ada-002 to generate the text embeddings? Will the vectors be similar enough to where they can be used interchangably, or are the vectors they produce fundamentally incompatible with what we're using it for, which is Azure AI Search?

The documents we're uploading are in plain text, if that matters.


Solution

  • Great question! I tried by writing some code and generated the tokens using both Tiktoken and Azure OpenAI SDK. Unfortunately the results are different, so the answer is no. You cannot use Tiktoken to generate the embedding.

    Here's my sample code:

    using Azure;
    using Azure.AI.OpenAI;
    using Tiktoken;
    
    var stringToEncode =
        "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";
    
    var encoder = Tiktoken.Encoding.TryForModel("text-embedding-ada-002");
    var data1 = encoder!.Encode(stringToEncode);//produced a 96 element array.
    
    var openAIClient = new OpenAIClient(new Uri("https://xyz.openai.azure.com/"), new AzureKeyCredential("my-azure-openai-key"));
    var embeddings = openAIClient.GetEmbeddings(new EmbeddingsOptions()
    {
        DeploymentName = "text-embedding-ada-002",
        Input = { stringToEncode }
    });
    var data2 = embeddings.Value.Data[0].Embedding;//produced a 1536 element array.
    Console.WriteLine($"Data 1 length: {data1.Count}; Data 2 length: {data2.Length}");