nlp cyrillic nlp-question-answering haystack

Generating Q&As from Cyrillic languages with Deepset Haystack

I'm trying to generate questions and answers based on an uploaded text.

I'm using opensource library Haystack by Deepset in order to do that as it works great with English texts.

However, with Cyrillic texts like Russian I get chopped words in the generated questions.

I train the model with Russian SberQUAD dataset. And then I'm trying to generate Q&As from Ruslan and Ludmila poem by Alexander Pushkin.

The answers seems OK mainly, but the questions are really mix of syllables.

Here is my code:

converter = TextConverter(remove_numeric_tables=False, valid_languages=["ru"])
doc = converter.convert(file_path='pushkins.txt', meta=None)[0]

preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word",
    split_length=100,
    split_respect_sentence_boundary=True,
)

docs_default = preprocessor.process([doc])
document_store.write_documents(docs_default)
question_generator = QuestionGenerator()
reader = FARMReader(model_name_or_path='cointegrated/rubert-tiny', use_gpu=True)
reader.train(data_dir=data_dir, train_filename="train-v1.1.json", dev_filename="dev-v1.1.json", use_gpu=True, batch_size=16, n_epochs=1, save_dir=data_dir) 

qag_pipeline = QuestionAnswerGenerationPipeline(question_generator, reader)
output_data = []

for idx, document in enumerate(tqdm(document_store)):
    print(f"\n * Generating questions and answers for document {idx}: {document.content[:100]}...\n")
    result = qag_pipeline.run(documents=[document])
    output_data.append(result)
    print_questions(result)
    print("---")

The output is just some mix of Russian syllables and English words:

Generated pairs:
 - Q: What is какат а доер мое?
      A: свои
      A: дети
      A: согласен Скакать за дочерь
 - Q: What удет не нарасен?
      A: подвиг
      A: моих
      A: княжной
 - Q: What was орестн ени?
      A: княжной
      A: княжной
      A: полцарством
 - Q: What did оскликнули свои седлаем?
      A: Сейчас коней
      A: жены
      A: «Я!» — молвил горестный жених. «Я! я!» — воскликнули с Рогдаем Фарла
 - Q: рад вес иедит мир.
      A: «Сейчас коней своих седлаем; Мы рады
      A: ъ
      A: лцарством прадедов моих

Solution

QuestionGenerator uses valhalla/t5-base-e2e-qg as default model.

Since you're using FARMReader with cointegrated/rubert-tiny you must use a compatible model for QuestionGenerator. Compatibility in this case is only in terms of language of the model.

question_generator = QuestionGenerator(model_name_or_path='nbroad/mt5-base-qgen')
reader = FARMReader(model_name_or_path='cointegrated/rubert-tiny', use_gpu=True)