Search code examples
pythonnlpnamed-entity-recognitionreadlines

Readlines causing error after many lines?


I'm working on a NRE task at the moment, with data from wnut17train.conll (https://github.com/leondz/emerging_entities_17). It's basically a collection of tweets where each line is a single word from the tweet with an IOB tag attached (separated by a \t). Different tweets are separated by a blank line (actually, and weirdly enough if you ask me, a '\t\n' line).

So, for reference, a single tweet would look like this:

@paulwalk    IOBtag
...          ...
foo          IOBtag
[\t\n]
@jerrybeam   IOBtag
...          ...
bar          IOBtag

The goal for this first step is to achieve a situation where I converted this data set into a training file looking like this:

train[0] = [(first_word_of_first_tweet, POStag, IOBtag),
(second_word_of_first_tweet, POStag, IOBtag),
...,
last_word_of_first_tweet, POStag, IOBtag)]

This is what I came up so far:

tmp = []
train = []
nlp = spacy.load("en_core_web_sm")
with open("wnut17train.conll") as f:
    for l in f.readlines():
        if l == '\t\n':
            train.append(tmp)
            tmp = []
        else:
            doc = nlp(l.split()[0])
            for token in doc:
                tmp.append((token.text, token.pos_, token.ent_iob_))

Everything works smoothly for a certain amount of tweets (or lines, not sure yet), but after that I get a

IndexError: list index out of range

raised by

doc = nlp(l.split()[0])

First time I got it around line 20'000 (20'533 to be precise), then after checking that this was not due to the file (maybe a different way of separating tweets, or something like this that might have tricked the parser) I removed the first 20'000 lines and tried again. Again, I got an error after around line 20'000 (20'260 - or 40'779 in the original file - to be precise).

I did some research on readlines() to see if this was a known problem but it looks like it's not. Am I missing something?


Solution

  • I used the wnut17train.conll file from https://github.com/leondz/emerging_entities_17 and I ran a similar code to generate your required output. I found that in some lines instead of "\t\n" as the blank Line we have only "\n".

    Due to this l.split() will give an IndexError: list index out of range. To handle this we can check if length is 1 and in that case also we add our tmp to train.

    import spacy
    nlp = spacy.load("en_core_web_sm")
    train = []
    tmp = []
    with open("wnut17train.conll") as fp:
        for l in fp.readlines():
            if l == "\t\n" or len(l) == 1:
                train.append(tmp)
                tmp = []
            else:
                doc = nlp(l.split("\t")[0])
                for token in doc:
                    tmp.append((l.split("\t")[0], token.pos_, l.split("\t")[1]))
    

    Hope your question is resolved.