Not a coding question, but a documentation omission that is nowhere mentioned online at this point. When using the Langchain CSVLoader, which column is being vectorized via the OpenAI embeddings I am using?
I ask because viewing this code below, I vectorized a sample CSV, did searches (on Pinecone) and consistently received back DISsimilar responses. How do know which column Langchain is actually identifying to vectorize?
loader = CSVLoader(file_path=file, metadata_columns=['col2', 'col3', 'col4','col5'])
langchain_docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
docs = text_splitter.split_documents(langchain_docs)
for doc in docs:
doc.metadata.pop('source')
doc.metadata.pop('row')
my_index = pc_store.from_documents(docs, embeddings, index_name=PINECONE_INDEX_NAME)
I am assuming the CSVLoader is then identifying col1 to vectorize. But, searches of Pinecone are terrible, leading me to think some other column is being vectorized.
You can check docs
variable, this is Document objects of list that contain content and metadata property.
Vectorized use Document's content and for a more detailed content you can refer to langchain csv_loader.py source code (line 98).
content = "\n".join(
f"{k.strip()}: {v.strip() if v is not None else v}"
for k, v in row.items()
if k not in self.metadata_columns
)
metadata = {"source": source, "row": i}