import nltk
from urllib import request
from redditscore.tokenizer import CrazyTokenizer
tokenizer = CrazyTokenizer()
url = "http://www.site.uottawa.ca/~diana/csi5386/A1_2020/microblog2011.txt"
response = request.urlopen(url)
raw = response.read().decode('utf-8-sig')
tokenizer.tokenize(raw)
I'm trying to tokenize the data which is in the url and while running i'm getting the following error
ValueError: [E088] Text of length 5190319 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length
limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text)
.
How to increase the length?
CrazyTokenizer is specially made for tweets and online comments, so really long texts should not occur. I guess your data is one tweet per line, so the best approach would be to feed one line at a time to your tokenizer:
from urllib import request
from redditscore.tokenizer import CrazyTokenizer
tokenizer = CrazyTokenizer()
url = "http://www.site.uottawa.ca/~diana/csi5386/A1_2020/microblog2011.txt"
for line in request.urlopen(url):
tokens = tokenizer.tokenize(line.decode('utf-8'))
print(tokens)