Search code examples
pythonnltktokenize

Adding <start> and <end> tokens to lines of a tokenized document


Apologies if I'm making an extremely trivial mistake! Essentially, I have a tokenized, downloaded document (which I tokenized the normal way using NLTK, i.e. tokens = word_tokenize(f.read()), meaning tokens is a list). I want to add start and end tokens to the start and end of each line. I also have a dictionary vocab which does what it sounds like (stores each word in the document).

The two things I have tried are:

for line in tokens:
    line.insert(0,'<start>')
    line.insert(len(line)-1,'<end>')
    vocab['<start>']+=1
    vocab['<end>']+=1

and:

for line in tokens:
    line=['<start>']+line+['<end>']
    vocab['<start>']+=1
    vocab['<end>']+=1

If I use the .insert() method, I get AttributeError: 'str' object has no attribute 'insert'. If I try to concatenate the start/end tokens to every line in the tokens list, I get TypeError: can only concatenate list (not "str") to list. Not really sure how to fix this so I'd appreciate any help :)


Solution

  • Try this.

    for i in range(len(tokens)):
        tokens[i] = '<start>' + tokens[i] + '<end>'
        vocab['<start>']+=1
        vocab['<end>']+=1
    

    basically what this does is to loop through each line and add start and end token to each line. Using range(len(tokens)), you can directly change element's value in tokens.