Search code examples
azureazure-cognitive-searchazure-openai

impact of (?) question mark in vector search


we are using azure cognitive search as a vector database, we are generating embeddings using the azure open ai Ada02 model for the query and the document (RAG pattern).

we are observing different results being produced for the same question with and without ? (Question mark)

  1. What is Maize ?
  2. What is Maize
  3. What is Maize?

questions

  1. what is the impact of a '?' in vector search especially in Azure Cognitive search.
  2. what is the standard way of handling it.
  3. Is Azure Cognitive Vector Search case sensitive.

Thanks -Nen


Solution

  • Embeddings are not a character-by-character representation of the input, they are a mapping into a continuous vector space, so it's expected that different inputs, no matter how small the differences, would produce different vectors, and thus may pull different results during search.

    They should be close since they are conceptually the same, but they aren't going to be the same vector.

    Here are two ways of digging into this a bit more, first comparing embeddings directly, second looking at the tokenization side:

    Comparing embeddings

    Using the embeddings API you can look at the distance between vectors directly, to separate them from the search/retrieval details:

    a = get_embedding("What is Maize?", engine="embedding")
    b = get_embedding("What is Maize ?", engine="embedding")
    c = get_embedding("What is Maize", engine="embedding")
    d = get_embedding("What is maize?", engine="embedding")
    e = get_embedding("What is corn?", engine="embedding")
    f = get_embedding("What is spinach?", engine="embedding")
    print("'?' vs ' ?'", cosine_similarity(a, b))
    print("'?' vs ''", cosine_similarity(a, c))
    print("' ?' vs ''", cosine_similarity(b, c))
    print("Maize vs maize", cosine_similarity(a, d))
    print("maize vs corn", cosine_similarity(a, e))
    print("maize vs spinach", cosine_similarity(a, f))
    

    I get:

    '?' vs ' ?' 0.9789760561431554
    '?' vs '' 0.9726684993796191
    ' ?' vs '' 0.9646235430443343
    Maize vs maize 0.982432778637022
    maize vs corn 0.9262367100603125
    maize vs spinach 0.8305263015872602
    

    OpenAI tokenizer:

    You can try the tokenizer here: https://platform.openai.com/tokenizer Two observations you can make: a) the "?" is considered a token (e.g. not ignored or anything like that), and b) different case produces different tokens.

    Screenshot


    Screenshot


    Screenshot


    Screenshot