I am trying to setup a PostgreSQL as vector database by following this guide: https://www.timescale.com/blog/postgresql-as-a-vector-database-create-store-and-query-openai-embeddings-with-pgvector/
However, I am struck at this step:
#Batch insert embeddings and metadata from dataframe into PostgreSQL database
register_vector(conn)
cur = conn.cursor()
# Prepare the list of tuples to insert
data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), np.array(row['embeddings'])) for index, row in df_new.iterrows()]
# Use execute_values to perform batch insertion
execute_values(cur, "INSERT INTO embeddings (title, url, content, tokens, embedding) VALUES %s", data_list)
# Commit after we insert all embeddings
conn.commit()
The error I am getting is
Traceback (most recent call last):
File "~/vector-cookbook-main/openai_pgvector_helloworld/script.py", line 166, in <module>
execute_values(cur, "INSERT INTO embeddings (title, url, content, tokens, embedding) VALUES %s;", data_list)
File "~/vector-cookbook-main/openai_pgvector_helloworld/venv/lib/python3.9/site-packages/psycopg2/extras.py", line 1296, in execute_values
parts.append(cur.mogrify(template, args))
File "~/vector-cookbook-main/openai_pgvector_helloworld/venv/lib/python3.9/site-packages/pgvector/psycopg2/__init__.py", line 14, in getquoted
return adapt(to_db(self._vector)).getquoted()
File "~/vector-cookbook-main/openai_pgvector_helloworld/venv/lib/python3.9/site-packages/pgvector/utils/__init__.py", line 27, in to_db
raise ValueError('expected ndim to be 1')
ValueError: expected ndim to be 1
This error occurs at execute_values method and I am not quite sure what that means.
I am very new to the Vector database and would appreciate any help in resolving this.
Thanks.
As Dunes said, you may have an array embedded into an array, where just a simple array is expected. At least, it's what happened to me when I did the guide from timescale.com (very good guide anyway, but it was written for openAI and not Azure openAI - several slight changes)
For me what worked was to remove the extra np.array
function :
before :
data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), np.array(row['embeddings'])) for index, row in df_new.iterrows()]
after :
data_list = [(row['title'], row['url'], row['content'], int(row['tokens']), row['embeddings']) for index, row in df_new.iterrows()]
At some time, the function that calculates the embedding vector already creates a feedback under the form of an array. With the np.array(row['embeddings'])
I created an array of array, which is not needed after.