I have the code:
loader = PyPDFLoader(“https://arxiv.org/pdf/2303.08774.pdf”)
data = loader.load()
docs = text_splitter1.split_documents(data)
vector_search_index = “vector_index”
vector_search = MongoDBAtlasVectorSearch.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(disallowed_special=()),
collection=atlas_collection,
index_name=vector_search_index,
)
query = "What were the compute requirements for training GPT 4"
results = vector_search1.similarity_search(query)
print("result: ", results)
And in results I have every time only empty array. I don't understand what I do wrong. This is the link on the langchain documentation with examples. Information is saved normally in database, but I cannot search info in this collection.
So I was able to get this to work in MongoDB with the following code:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
docs = text_splitter.split_documents(data)
DB_NAME = "langchain_db"
COLLECTION_NAME = "atlas_collection"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "vector_index"
MONGODB_ATLAS_CLUSTER_URI = uri = os.environ.get("MONGO_DB_ENDPOINT")
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)
MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]
vector_search = MongoDBAtlasVectorSearch.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(disallowed_special=()),
collection=MONGODB_COLLECTION,
index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)
query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)
print("result: ", results)
At this point, I did get the same results that you did. Before it would work, I had to create the vector search index and I made sure it was named the same as what is specified in ATLAS_VECTOR_SEARCH_INDEX_NAME
:
FWIW - It was easier for me to do in Astra DB (I tried this first, because I am a DataStax employee):
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()
docs = text_splitter.split_documents(data)
atlas_collection = "atlas_collection"
ASTRA_DB_API_ENDPOINT = os.environ.get("ASTRA_DB_API_ENDPOINT")
ASTRA_DB_APPLICATION_TOKEN = os.environ.get("ASTRA_DB_APPLICATION_TOKEN")
vector_search = AstraDBVectorStore.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(disallowed_special=()),
collection_name=atlas_collection,
api_endpoint=ASTRA_DB_API_ENDPOINT,
token=ASTRA_DB_APPLICATION_TOKEN,
)
query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)
print("result: ", results)
Worth noting, that Astra DB will create your vector index automatically based on the dimensions of the embedding model.