Search code examples

NotImplementedError: The lemmatize parameter is no longer supported

I have run the code for my own similar gpt2 model, but the below Error was got it. How to solve this implement error in python.

corpus = WikiCorpus(file_path, lemmatize=False, lower=False, tokenizer_func=tokenizer_func)
  File "C:\Rayi\python\text-generate\text-gene\lib\site-packages\gensim\corpora\", line 619, in __init__
    raise NotImplementedError(
NotImplementedError: The lemmatize parameter is no longer supported. If you need to lemmatize, use e.g. <>. Perform lemmatization as part of your tokenization function and pass it as the tokenizer_func parameter to this initializer.
import tensorflow as tf
from gensim.corpora import WikiCorpus
import os
import argparse

# lang = 'bn'

def store(corpus, lang):
    base_path = os.getcwd()
    store_path = os.path.join(base_path, '{}_corpus'.format(lang))
    if not os.path.exists(store_path):
    for text in corpus.get_texts():
        current_file_path = os.path.join(store_path, 'article_{}.txt'.format(file_idx))
        with open(current_file_path, 'w' , encoding='utf-8') as file:
            file.write(bytes(' '.join(text), 'utf-8').decode('utf-8'))
        file_idx += 1

def tokenizer_func(text: str, token_min_len: int, token_max_len: int, lower: bool) -> list:
    return [token for token in text.split() if token_min_len <= len(token) <= token_max_len]

def run(lang):
    file_path = tf.keras.utils.get_file(origin=origin, fname=fname, untar=False, extract=False)
    corpus = WikiCorpus(file_path, lemmatize=True, lower=False, tokenizer_func=tokenizer_func)
    store(corpus, lang)

if __name__ == '__main__':
    ARGS_PARSER = argparse.ArgumentParser()
        help='language code to download from wikipedia corpus'
    ARGS = ARGS_PARSER.parse_args()


  • I couldn't find any actual documentation for this function, just some example page.

    What I did was just calling:

    corpus = WikiCorpus(file_path, tokenizer_func=tokenizer_func)

    If you still want to lemmatize, call some lemmatization function in tokenizer_func, as described in error message.

    And now wait around 8h to process :D