Search code examples
pinecone

What's the best index distance metric for my Pinecone vector database, filled with a series of similarly formatted Markdown files?


As the title states, I'm wondering if I can get more insight into choosing a metric for my Pinecone database index. Currently, they offer 3 options to choose from. From their documentation, they are:

  • euclidean - This is used to calculate the distance between two data points in a plane. It is one of the most commonly used distance metric. For an example, see our image similarity search example. When you use metric='euclidean', the most similar results are those with the lowest score.
  • cosine - This is often used to find similarities between different documents. The advantage is that the scores are normalized to [-1,1] range.
  • dotproduct - This is used to multiply two vectors. You can use it to tell us how similar the two vectors are. The more positive the answer is, the closer the two vectors are in terms of their directions.

In my case, I have generated human-like descriptions of a fairly similar repeating dataset in Markdown files, however, I'm wondering if this just adds noise to my data since the only thing changing in each file are (mainly) the numbers. Imagine these document examples:

Document 1:

# August 11th, 2023
Today we sold 5 apples and 3 oranges.

Document 2:

# August 12th, 2023
Today we sold 2 apples and 6 oranges.

Document 3:

# August 13th, 2023
Today we sold 0 apples and 1 orange.

and so on...

Then, you could imagine queries to be something like "how many apples did we sell on August 12th, 2023?" I though this would be "simple enough" for a custom embedding, but results are far from correct most of the time! I am currently using the cosine index.

I have a variety of questions that I haven't been able to find clear answers to:

First, for this type of data, which index distance metric makes the most sense?

Second, am I overcomplicating this problem and I should just leave the dataset in a raw format (i.e. JSON)?

Third, it is possible to create a sort of 'summary' file, that I could give more weight to against the 'daily' documents in my queries? Or is the whole point of RAG that I DON'T need to weight the documents seperately, I can just 'trust' the initial retrieval? Such a summary file would include a variety of statistics that would be likely often queried for. (In my example, perhaps the total YTD sales of apples and oranges, and averages sales of apples and oranges per day)


Solution

  • You should use the same similarity metric used to train the model that created the embeddings.

    For example, if you're using OpenAI (any of their GPT models so far) you should use cosine similarity.

    References