python postgresql vector-database pgvector

Error with PostgreSQL as a Vector Database

I am trying to setup a PostgreSQL as vector database by following this guide: https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/

However, I am struck at this step:

#Batch insert embeddings and metadata from dataframe into PostgreSQL database
register_vector(conn)
cur = conn.cursor()
# Prepare the list of tuples to insert
data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), np.array(row['embeddings'])) for index, row in df_new.iterrows()]
# Use execute_values to perform batch insertion
execute_values(cur, "INSERT INTO embeddings (title, url, content, tokens, embedding) VALUES %s", data_list)
# Commit after we insert all embeddings
conn.commit()

The error I am getting is

Traceback (most recent call last):
  File "~/vector-cookbook-main/openai_pgvector_helloworld/script.py", line 166, in <module>
    execute_values(cur, "INSERT INTO embeddings (title, url, content, tokens, embedding) VALUES %s;", data_list)
  File "~/vector-cookbook-main/openai_pgvector_helloworld/venv/lib/python3.9/site-packages/psycopg2/extras.py", line 1296, in execute_values
    parts.append(cur.mogrify(template, args))
  File "~/vector-cookbook-main/openai_pgvector_helloworld/venv/lib/python3.9/site-packages/pgvector/psycopg2/__init__.py", line 14, in getquoted
    return adapt(to_db(self._vector)).getquoted()
  File "~/vector-cookbook-main/openai_pgvector_helloworld/venv/lib/python3.9/site-packages/pgvector/utils/__init__.py", line 27, in to_db
    raise ValueError('expected ndim to be 1')
ValueError: expected ndim to be 1

This error occurs at execute_values method and I am not quite sure what that means.

I am very new to the Vector database and would appreciate any help in resolving this.

Thanks.

Solution

As Dunes said, you may have an array embedded into an array, where just a simple array is expected. At least, it's what happened to me when I did the guide from timescale.com (very good guide anyway, but it was written for openAI and not Azure openAI - several slight changes)

For me what worked was to remove the extra np.array function :

before :

data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), np.array(row['embeddings'])) for index, row in df_new.iterrows()]

after :

data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), row['embeddings']) for index, row in df_new.iterrows()]

At some time, the function that calculates the embedding vector already creates a feedback under the form of an array. With the np.array(row['embeddings']) I created an array of array, which is not needed after.