python azure azure-openai llama llama-index

Unknown Document Type Error while using LLamaIndex with Azure OpenAI

I'm trying to reproduce the code from documentation: https://docs.llamaindex.ai/en/stable/examples/customization/llms/AzureOpenAI.html and receive the following error after index = VectorStoreIndex.from_documents(documents):

raise ValueError(f"Unknown document type: {type(document)}")
ValueError: Unknown document type: <class 'llama_index.legacy.schema.Document'>

Due to the fact that all these generative ai libraries are being constantly updated, I have to switch the import of SimpleDirectoryReader and make it like from llama_index.legacy.readers.file.base import SimpleDirectoryReader All the rest is actually the same with tutorial (using llama_index==0.10.18 and python of version 3.9.16). I have spent already several hours on that and actually don't have ideas how should I proceed. So if somebody can assist with that - it would be super helpful :)

Many thanks in advance.

Solution

The error occurs because of the type of document you are passing for VectorStoreIndex.from_documents().

When you import SimpleDirectoryReader from legacy modules, the type of document is llama_index.legacy.schema.Document.

enter image description here

You are passing that to VectorStoreIndex, which is imported from core modules: from llama_index.core import VectorStoreIndex.

The document you referred to is correct for core modules, and you can import SimpleDirectoryReader as from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, and everything will work fine.

If you wish to use legacy modules, then use the code below.

from llama_index.legacy.llms.azure_openai import AzureOpenAI
from llama_index.legacy.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.legacy import SimpleDirectoryReader, VectorStoreIndex, ServiceContext
import logging
import sys

logging.basicConfig(
    stream=sys.stdout, level=logging.INFO
)  # logging.DEBUG for more verbose output
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

api_key = "3c9xxxyyyyzzzzzssssssdb9"
azure_endpoint = "https://<resource_name>.openai.azure.com/"
api_version = "2023-07-01-preview"

llm = AzureOpenAI(
    model="gpt-4",
    deployment_name="gpt4",
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
)

# You need to deploy your own embedding model as well as your own chat completion model
embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name="embeding1",
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
)

documents = SimpleDirectoryReader(input_files=["./data/s1.txt"]).load_data()
type(documents[0])

service_context = ServiceContext.from_defaults(
    llm=llm, embed_model=embed_model
)

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

Output:

query = "What is the model name and who updated it last?"
query_engine = index.as_query_engine()
answer = query_engine.query(query)
print("query was:", query)
print("answer was:", answer)

enter image description here

Here, when using legacy modules, all tools and models should be imported from the same legacy modules, and an additional service context is used for the vector store index.