Search code examples
pythonhaystack

Haystack - Error in executing squad_to_dpr.py


I am getting an error when I execute the squad_to_dpr.py link script on my custom squad2 formatted training dataset while using the dpr retriever. The same error happens with the BM25 retriever. The script works fine with the development set. Both files are json files with data formatted as per the squad2 format.

Error message

Traceback (most recent call last):
File "/home/user/Documents/haystackwork/DPR/squad_to_dpr.py", line 348, in
main(
File "/home/user/Documents/haystackwork/DPR/squad_to_dpr.py", line 271, in main
document_store.add_eval_data(squad_file_path.as_posix(), doc_index="document", preprocessor=preprocessor)
File "/home/user/opt/anaconda3/envs/haystack_env/lib/python3.9/site-packages/haystack/document_stores/base.py", line 446, in add_eval_data
docs, labels = eval_data_from_json(
File "/home/user/opt/anaconda3/envs/haystack_env/lib/python3.9/site-packages/haystack/document_stores/utils.py", line 46, in eval_data_from_json
cur_docs, cur_labels, cur_problematic_ids = _extract_docs_and_labels_from_dict(
File "/home/user/opt/anaconda3/envs/haystack_env/lib/python3.9/site-packages/haystack/document_stores/utils.py", line 225, in _extract_docs_and_labels_from_dict
context=cur_doc.content,
UnboundLocalError: local variable 'cur_doc' referenced before assignment

I am unable to understand the exact reason of this error.

Expected behavior The squad2 format data file should be converted into the DPR format as expected.

Additional context I accepted all defaults in the script and have not changed anything inside the script.

To Reproduce I invoke the script as:

python squad_to_dpr.py --squad_input_filename dataset_squad/urqa_train_nqa_v1.json --dpr_output_filename corpus_dpr/urqa_train_dpr.json --num_hard_negative_ctxs 2

System:

OS: Ubuntu 22.04 LTS
GPU/CPU: i7 10750H + nVidia RTX 2060
Haystack version (commit or version number): farm_haystack-1.21.2
DocumentStore: ElasticsearchDocumentStore
Reader: None
Retriever: DensePassageRetriever

Solution

  • You could try different settings with the Preprocessor in the script. By default, it uses the following:

    preprocessor = PreProcessor(
            split_length=100,
            split_overlap=0,
            clean_empty_lines=False,
            split_respect_sentence_boundary=False,
            clean_whitespace=False,
        )
    

    I suggest that you set clean_whitespace=True and change the split_length to 150 for example. You could also try out enabling split_respect_sentence_boundary. The problem seems to be an edge case that is triggered by one of your document texts (called contexts in SQuAD format) being longer than 100 words. This causes the preprocessor to split the text into multiple shorter texts and during this splitting there seems to be a bug. Usually the following if condition becomes true for one of the splits but for some reason none of the splits s in your example fulfill the condition (line 212 in the implementation of _extract_docs_and_labels_from_dict):

    if (answer["answer_start"] >= s.meta["_split_offset"]) and 
    (answer["answer_start"] < (s.meta["_split_offset"] + len(s.content))):
    

    It would be very helpful if you could share the text that causes this error and open an issue on GitHub: https://github.com/deepset-ai/haystack/issues/new/choose