python mongodb langchain py-langchain vector-search

I get an empty array from vector serch in mongoDB with langchain

I have the code:

loader = PyPDFLoader(“https://arxiv.org/pdf/2303.08774.pdf”)
data = loader.load()
docs = text_splitter1.split_documents(data)
vector_search_index = “vector_index”

vector_search = MongoDBAtlasVectorSearch.from_documents(
  documents=docs,
  embedding=OpenAIEmbeddings(disallowed_special=()),
  collection=atlas_collection,
  index_name=vector_search_index,
)

query = "What were the compute requirements for training GPT 4"
results = vector_search1.similarity_search(query)
print("result: ", results)

And in results I have every time only empty array. I don't understand what I do wrong. This is the link on the langchain documentation with examples. Information is saved normally in database, but I cannot search info in this collection.

Solution

So I was able to get this to work in MongoDB with the following code:

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
docs = text_splitter.split_documents(data)

DB_NAME = "langchain_db"
COLLECTION_NAME = "atlas_collection"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "vector_index"
MONGODB_ATLAS_CLUSTER_URI = uri = os.environ.get("MONGO_DB_ENDPOINT")

client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)
MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]

vector_search = MongoDBAtlasVectorSearch.from_documents(
    documents=docs,
    embedding=OpenAIEmbeddings(disallowed_special=()),
    collection=MONGODB_COLLECTION,
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)

query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)
print("result: ", results)

At this point, I did get the same results that you did. Before it would work, I had to create the vector search index and I made sure it was named the same as what is specified in ATLAS_VECTOR_SEARCH_INDEX_NAME:

FWIW - It was easier for me to do in Astra DB (I tried this first, because I am a DataStax employee):

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
docs = text_splitter.split_documents(data)
atlas_collection = "atlas_collection"

ASTRA_DB_API_ENDPOINT = os.environ.get("ASTRA_DB_API_ENDPOINT")
ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN")

vector_search = AstraDBVectorStore.from_documents(
  documents=docs,
  embedding=OpenAIEmbeddings(disallowed_special=()),
  collection_name=atlas_collection,
  api_endpoint=ASTRA_DB_API_ENDPOINT,
  token=ASTRA_DB_APPLICATION_TOKEN,
)

query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)
print("result: ", results)

Worth noting, that Astra DB will create your vector index automatically based on the dimensions of the embedding model.