Search code examples
nlptokenize

Text length exeeds maximum - How to increase it?


  import nltk
  from urllib import request
  from redditscore.tokenizer import CrazyTokenizer
  tokenizer = CrazyTokenizer()
  url = "http://www.site.uottawa.ca/~diana/csi5386/A1_2020/microblog2011.txt"
  response = request.urlopen(url)
  raw = response.read().decode('utf-8-sig')
  tokenizer.tokenize(raw)

I'm trying to tokenize the data which is in the url and while running i'm getting the following error ValueError: [E088] Text of length 5190319 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).

How to increase the length?


Solution

  • CrazyTokenizer is specially made for tweets and online comments, so really long texts should not occur. I guess your data is one tweet per line, so the best approach would be to feed one line at a time to your tokenizer:

    from urllib import request
    from redditscore.tokenizer import CrazyTokenizer
    tokenizer = CrazyTokenizer()
    url = "http://www.site.uottawa.ca/~diana/csi5386/A1_2020/microblog2011.txt"
    for line in request.urlopen(url):
        tokens = tokenizer.tokenize(line.decode('utf-8'))
        print(tokens)