Search code examples
pythonloaderlangchainpinecone

Langchain CSVLoader


Not a coding question, but a documentation omission that is nowhere mentioned online at this point. When using the Langchain CSVLoader, which column is being vectorized via the OpenAI embeddings I am using?

I ask because viewing this code below, I vectorized a sample CSV, did searches (on Pinecone) and consistently received back DISsimilar responses. How do know which column Langchain is actually identifying to vectorize?

loader = CSVLoader(file_path=file, metadata_columns=['col2', 'col3', 'col4','col5'])
langchain_docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
docs = text_splitter.split_documents(langchain_docs)
for doc in docs:
    doc.metadata.pop('source')
    doc.metadata.pop('row')
my_index = pc_store.from_documents(docs, embeddings, index_name=PINECONE_INDEX_NAME)

I am assuming the CSVLoader is then identifying col1 to vectorize. But, searches of Pinecone are terrible, leading me to think some other column is being vectorized.


Solution

  • You can check docs variable, this is Document objects of list that contain content and metadata property.

    Vectorized use Document's content and for a more detailed content you can refer to langchain csv_loader.py source code (line 98).

    content = "\n".join(
                    f"{k.strip()}: {v.strip() if v is not None else v}"
                    for k, v in row.items()
                    if k not in self.metadata_columns
                )
    metadata = {"source": source, "row": i}