Search code examples
pythonpython-3.xnlpspacy

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. spacy


I'm trying to create a corpus of words by a text. I use spacy. So there is my code:

import spacy
nlp = spacy.load('fr_core_news_md')
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()

But it returns this exception:

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

I tried somthing like this:

import spacy
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 1027203
f = open("text.txt")
doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
f.close()
del f
words = []
for token in doc:
    if token.lemma_ not in words:
        words.append(token.lemma_)

f = open("corpus.txt", 'w')
f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
f.close()

But got the same error:

ValueError: [E088] Text of length 1027203 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

How to fix it?


Solution

  • I differ from the answer above and I think nlp.max_length did execute correctly but the value set is too low. It looks like you have set it to exactly the value in the error message.Increase the nlp.max_length to a little over the number in the error message:

    nlp.max_length = 1030000 # or even higher
    

    It should ideally work after this.

    So your code could be changed to this

    import spacy
    nlp = spacy.load('fr_core_news_md')
    nlp.max_length = 1030000 # or higher
    f = open("text.txt")
    doc = nlp(''.join(ch for ch in f.read() if ch.isalnum() or ch == " "))
    f.close()
    del f
    words = []
    for token in doc:
        if token.lemma_ not in words:
            words.append(token.lemma_)
    
    f = open("corpus.txt", 'w')
    f.write("Number of words:" + str(len(words)) + "\n" + ''.join([i + "\n" for i in sorted(words)]))
    f.close()