Search code examples
pythonfor-loopnlpnltkpunctuation

Remove punctuation marks from tokenized text using for loop


I'm trying to remove punctuations from a tokenized text in python like so:

word_tokens = ntlk.tokenize(text)
w = word_tokens
for e in word_tokens:
    if e in punctuation_marks:
        w.remove(e)

This works somewhat, I manage to remove a lot of the punctuation marks but for some reason a lot of the punctuation marks in word_tokens are still left. If I run the code another time, it again removes some more of the punctuations. After running the same code 3 times all the marks are removed. Why does this happen?

It doesn't seem to matter whether punctuation_marks is a list, a string or a dictionary. I've also tried to iterate over word_tokens.copy() which does a bit better, it almost removes all marks the first time, and all the second time. Is there a simple way to fix this problem so that it is sufficient to run the code only once?


Solution

  • You are removing elements from the same list that you are iterating. It seems that you are aware of the potential problem, that's why you added the line:

    w = word_tokens

    However, that line doesn't actually create a copy of the object referenced by word_tokens, it only makes w reference the same object. In order to create a copy you can use the slicing operator, replacing the above line by:

    w = word_tokens[:]