Search code examples
pythonazureazure-openaillamallama-index

Unknown Document Type Error while using LLamaIndex with Azure OpenAI


I'm trying to reproduce the code from documentation: https://docs.llamaindex.ai/en/stable/examples/customization/llms/AzureOpenAI.html and receive the following error after index = VectorStoreIndex.from_documents(documents):

raise ValueError(f"Unknown document type: {type(document)}")
ValueError: Unknown document type: <class 'llama_index.legacy.schema.Document'>

Due to the fact that all these generative ai libraries are being constantly updated, I have to switch the import of SimpleDirectoryReader and make it like from llama_index.legacy.readers.file.base import SimpleDirectoryReader All the rest is actually the same with tutorial (using llama_index==0.10.18 and python of version 3.9.16). I have spent already several hours on that and actually don't have ideas how should I proceed. So if somebody can assist with that - it would be super helpful :)

Many thanks in advance.


Solution

  • The error occurs because of the type of document you are passing for VectorStoreIndex.from_documents().

    When you import SimpleDirectoryReader from legacy modules, the type of document is llama_index.legacy.schema.Document.

    enter image description here

    You are passing that to VectorStoreIndex, which is imported from core modules: from llama_index.core import VectorStoreIndex.

    The document you referred to is correct for core modules, and you can import SimpleDirectoryReader as from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, and everything will work fine.

    If you wish to use legacy modules, then use the code below.

    from llama_index.legacy.llms.azure_openai import AzureOpenAI
    from llama_index.legacy.embeddings.azure_openai import AzureOpenAIEmbedding
    from llama_index.legacy import SimpleDirectoryReader, VectorStoreIndex, ServiceContext
    import logging
    import sys
    
    logging.basicConfig(
        stream=sys.stdout, level=logging.INFO
    )  # logging.DEBUG for more verbose output
    logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
    
    api_key = "3c9xxxyyyyzzzzzssssssdb9"
    azure_endpoint = "https://<resource_name>.openai.azure.com/"
    api_version = "2023-07-01-preview"
    
    llm = AzureOpenAI(
        model="gpt-4",
        deployment_name="gpt4",
        api_key=api_key,
        azure_endpoint=azure_endpoint,
        api_version=api_version,
    )
    
    # You need to deploy your own embedding model as well as your own chat completion model
    embed_model = AzureOpenAIEmbedding(
        model="text-embedding-ada-002",
        deployment_name="embeding1",
        api_key=api_key,
        azure_endpoint=azure_endpoint,
        api_version=api_version,
    )
    
    documents = SimpleDirectoryReader(input_files=["./data/s1.txt"]).load_data()
    type(documents[0])
    
    service_context = ServiceContext.from_defaults(
        llm=llm, embed_model=embed_model
    )
    
    index = VectorStoreIndex.from_documents(documents, service_context=service_context)
    

    Output:

    query = "What is the model name and who updated it last?"
    query_engine = index.as_query_engine()
    answer = query_engine.query(query)
    print("query was:", query)
    print("answer was:", answer)
    

    enter image description here

    Here, when using legacy modules, all tools and models should be imported from the same legacy modules, and an additional service context is used for the vector store index.