Search code examples

Why is pre-processing causing me to lose dictionary keys?

Having a very peculiar problem. The extract function takes an XML file and produces a dict using restaurant reviews as keys. Here I am doing some basic preprocessing to the text as I'm using it for sentiment analysis: the text is tokenized, punctuation is removed and it is 'un-tokenized' before being reinserted into the dict.

import string
from nltk.tokenize import word_tokenize, RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')

def preprocess(file):
    d = extract(file)
    for text in list(d.keys()):
        tokenized_text = tokenizer.tokenize(text)
        text2 = ''.join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokenized_text]).strip()
        d[text2] = d.pop(text) 
    return d

Of the 675 reviews, 2 are missing after this function has run. These are 'great service.' and 'Delicious'. I would expect these to be returned as they are, except the full stop should be taken away from the first.

For reference, the extract function:

from collections import OrderedDict, defaultdict
import xml.etree.ElementTree as ET

def extract(file):

    tree = ET.parse(file)
    root = tree.getroot()

    if file == 'EN_REST_SB1_TEST.xml':
        d = OrderedDict()
        for sentence in root.findall('.//sentence'):
            opinion = sentence.findall('.//Opinion')
            if opinion == []:
                text = sentence.find('text').text
                d[text] = 0

        return d 

If anybody is familiar with the SemEval ABSA tasks, you'll note I've done this in a somewhat roundabout way, not making use of the id tags in the XML but I'd prefer to stick to how I've done it.


  • You're using the reviews as keys, which means you'll lose any duplicates. Evidently these very short reviews occurred twice.

    I can't think of any reason to use the reviews as keys, especially if you care about holding on to duplicates. So why not just collect them into a list?

    d = []