python python-3.x openai-api chatgpt-api llama-index

I'm trying to run the llama index model, but when I get to the index building step - it fails time and time again, how can I fix this?

I'm trying to use the llama_index model which builds an index from your personal documents, and allows you to ask questions about the information from the GPT chat.

This is the full code (of course with my API):

import os
os.environ["OPENAI_API_KEY"] = 'YOUR_OPENAI_API_KEY'

from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader
documents = SimpleDirectoryReader('data').load_data()
index = GPTSimpleVectorIndex.from_documents(documents)

When I run the index build according to the steps in their documentation, it fails at this step:

index = GPTSimpleVectorIndex.from_documents(documents)

with the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\indices\base.py", line 92, in from_documents
    service_context = service_context or ServiceContext.from_defaults()
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\indices\service_context.py", line 71, in from_defaults
    embed_model = embed_model or OpenAIEmbedding()
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\embeddings\openai.py", line 209, in __init__
    super().__init__(**kwargs)
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\embeddings\base.py", line 55, in __init__
    self._tokenizer: Callable = globals_helper.tokenizer
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\llama_index\utils.py", line 50, in tokenizer
    enc = tiktoken.get_encoding("gpt2")
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\tiktoken\registry.py", line 63, in get_encoding
    enc = Encoding(**constructor())
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\tiktoken_ext\openai_public.py", line 11, in gpt2
    mergeable_ranks = data_gym_to_mergeable_bpe_ranks(
  File "C:\Users\COLMI\AppData\Local\Programs\Python\Python310\lib\site-packages\tiktoken\load.py", line 83, in data_gym_to_mergeable_bpe_ranks
    for first, second in bpe_merges:
ValueError: not enough values to unpack (expected 2, got 1)

I should mention that I tried this on DOCX files inside a specific folder that contains such files and folders, also inside subfolders.

Solution

I seem to have had a problem with the whole code usage approach.

The value 'data' is not used as a parameter for defining a function, but simply marks an example of a folder name that contains the user's files.

A local path can be used like:

documents = SimpleDirectoryReader('my_folder').load_data()

or in a fixed path, such as:

documents = SimpleDirectoryReader('c:\users\user\my_files').load_data()

If you use this approach, everything will work as expected.