machine-learning deep-learning large-language-model llama chromadb

Why RAG is slower than LLM?

I used RAG with LLAMA3 for AI bot. I find RAG with chromadb is much slower than call LLM itself. Following the test result, with just one simple web page about 1000 words, it takes more than 2 seconds for retrieving:

Time used for retrieving: 2.245511054992676
Time used for LLM: 2.1182022094726562

Here is my simple code:

embeddings = OllamaEmbeddings(model="llama3")
vectorstore = Chroma.from_documents(documents=splits, embedding=embeddings)
retriever = vectorstore.as_retriever()
question = "What is COCONut?"
start = time.time()
retrieved_docs = retriever.invoke(question)
formatted_context = combine_docs(retrieved_docs)
end = time.time()
print(f"Time used for retrieving: {end - start}")

start = time.time()
answer = ollama_llm(question, formatted_context)
end = time.time()
print(f"Time used for LLM: {end - start}")

I found when my chromaDB size just about 1.4M, it takes more than 20 seconds for retrieving and still only takes about 3 or 4 seconds for LLM. Is there anything I missing? or RAG tech itself is so slow?

Solution

Retrieval-Augmented Generation (RAG) models are slower as compared to Large Language Models (LLMs) due to an extra retrieval step.
Since RAG models search a database for relevant information, which can be time-consuming, especially with large databases, it is tend to be slower. Versus LLMs respond faster as they rely on pre-trained information and skip the said database retrieval step.
You must also note that LLMs may lack the most current or specific information compared to RAG models, which usually access external data sources and can provide more detailed responses using the latest information.
Thus, Despite being slower, RAG models have the advantage in response quality and relevance for complex, information-rich queries. Hope I am able to help.

Please mark this as a answer, if you think this sets a perfect context for your case.