Apologies if I'm making an extremely trivial mistake! Essentially, I have a tokenized, downloaded document (which I tokenized the normal way using NLTK, i.e. tokens = word_tokenize(f.read())
, meaning tokens
is a list). I want to add start and end tokens to the start and end of each line. I also have a dictionary vocab
which does what it sounds like (stores each word in the document).
The two things I have tried are:
for line in tokens:
line.insert(0,'<start>')
line.insert(len(line)-1,'<end>')
vocab['<start>']+=1
vocab['<end>']+=1
and:
for line in tokens:
line=['<start>']+line+['<end>']
vocab['<start>']+=1
vocab['<end>']+=1
If I use the .insert() method, I get AttributeError: 'str' object has no attribute 'insert'
. If I try to concatenate the start/end tokens to every line in the tokens list, I get TypeError: can only concatenate list (not "str") to list.
Not really sure how to fix this so I'd appreciate any help :)
Try this.
for i in range(len(tokens)):
tokens[i] = '<start>' + tokens[i] + '<end>'
vocab['<start>']+=1
vocab['<end>']+=1
basically what this does is to loop through each line and add start and end token to each line. Using range(len(tokens)), you can directly change element's value in tokens.