Search code examples
pythonopenai-apilangchainchatgpt-apipy-langchain

LangChain ConversationalRetrieval with JSONloader


I modified the data loader of this source code https://github.com/techleadhd/chatgpt-retrieval for ConversationalRetrievalChain to accept data as JSON.

I created a dummy JSON file and according to the LangChain documentation, it fits JSON structure as described in the document.

{
  "reviews": [
    {"text": "Great hotel, excellent service and comfortable rooms."},
    {"text": "I had a terrible experience at this hotel. The room was dirty and the staff was rude."},
    {"text": "Highly recommended! The hotel has a beautiful view and the staff is friendly."},
    {"text": "Average hotel. The room was okay, but nothing special."},
    {"text": "I absolutely loved my stay at this hotel. The amenities were top-notch."},
    {"text": "Disappointing experience. The hotel was overpriced for the quality provided."},
    {"text": "The hotel exceeded my expectations. The room was spacious and clean."},
    {"text": "Avoid this hotel at all costs! The customer service was horrendous."},
    {"text": "Fantastic hotel with a great location. I would definitely stay here again."},
    {"text": "Not a bad hotel, but there are better options available in the area."}
  ]
}

The code is :

import os
import sys

import openai
from langchain.chains import ConversationalRetrievalChain, RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.indexes import VectorstoreIndexCreator
from langchain.indexes.vectorstore import VectorStoreIndexWrapper
from langchain.llms import OpenAI
from langchain.vectorstores import Chroma
from langchain.document_loaders import JSONLoader

os.environ["OPENAI_API_KEY"] = 'YOUR_API_KEY_HERE'

# Enable to save to disk & reuse the model (for repeated queries on the same data)
PERSIST = False

query = None
if len(sys.argv) > 1:
  query = sys.argv[1]


if PERSIST and os.path.exists("persist"):
  print("Reusing index...\n")
  vectorstore = Chroma(persist_directory="persist", embedding_function=OpenAIEmbeddings())
  index = VectorStoreIndexWrapper(vectorstore=vectorstore)
else:

  loader = JSONLoader("data/review.json", jq_schema=".reviews[]", content_key='text') # Use this line if you only need data.json

  if PERSIST:
    index = VectorstoreIndexCreator(vectorstore_kwargs={"persist_directory":"persist"}).from_loaders([loader])
  else:
    index = VectorstoreIndexCreator().from_loaders([loader])

chain = ConversationalRetrievalChain.from_llm(
  llm=ChatOpenAI(model="gpt-3.5-turbo"),
  retriever=index.vectorstore.as_retriever()
)

chat_history = []
while True:
  if not query:
    query = input("Prompt: ")
  if query in ['quit', 'q', 'exit']:
    sys.exit()
  result = chain({"question": query, "chat_history": chat_history})
  print(result['answer'])

  chat_history.append((query, result['answer']))
  query = None

Some examples of results are:

Prompt: can you summarize the data?
Sure! Based on the provided feedback, we have a mix of opinions about the hotels. One person found it to be an average hotel with nothing special, another person had a great experience with excellent service and comfortable rooms, another person was pleasantly surprised by a hotel that exceeded their expectations with spacious and clean rooms, and finally, someone had a disappointing experience with an overpriced hotel that didn't meet their expectations in terms of quality.

Prompt: how many feedbacks present in the data ?
There are four feedbacks present in the data.

Prompt: how many of them are positive (sentiment)?
There are four positive feedbacks present in the data.

Prompt: how many of them are negative?
There are three negative feedbacks present in the data.

Prompt: how many of them are neutral?
Two of the feedbacks are neutral.

Prompt: what is the last review you can see?
The most recent review I can see is: "The hotel exceeded my expectations. The room was spacious and clean."

Prompt: what is the first review you can see?
The first review I can see is "Highly recommended! The hotel has a beautiful view and the staff is friendly."

Prompt: how many total texts are in the JSON file?
I don't know the answer.

I can chat with my data but except for the first answer, all other answers are wrong.

Is there a problem with JSONloader or jq_scheme? How can I adapt the code so that I can generate the expected output?


Solution

  • In ConversationalRetrievalChain , search is setup to default 4, refer top_k_docs_for_context: int = 4 in ../langchain/chains/conversational_retrieval/base.py . enter image description here

    That makes sense as you don't want to send all the vectors to LLM model(associated cost too). Based on the usecase, you can change the default to more manageable, using the following:

    chain = ConversationalRetrievalChain.from_llm(
      llm=ChatOpenAI(model="gpt-3.5-turbo"),
      retriever=index.vectorstore.as_retriever(search_kwargs={"k": 10})
    )
    

    with this change, you will get the result

    {'question': 'how many feedbacks present in the data ?',
     'chat_history': [],
     'answer': 'There are 10 pieces of feedback present in the data.'}