Having a very peculiar problem. The extract
function takes an XML file and produces a dict using restaurant reviews as keys. Here I am doing some basic preprocessing to the text as I'm using it for sentiment analysis: the text is tokenized, punctuation is removed and it is 'un-tokenized' before being reinserted into the dict.
import string
from nltk.tokenize import word_tokenize, RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')
def preprocess(file):
d = extract(file)
for text in list(d.keys()):
tokenized_text = tokenizer.tokenize(text)
text2 = ''.join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokenized_text]).strip()
d[text2] = d.pop(text)
return d
Of the 675 reviews, 2 are missing after this function has run. These are 'great service.' and 'Delicious'. I would expect these to be returned as they are, except the full stop should be taken away from the first.
For reference, the extract
function:
from collections import OrderedDict, defaultdict
import xml.etree.ElementTree as ET
def extract(file):
tree = ET.parse(file)
root = tree.getroot()
if file == 'EN_REST_SB1_TEST.xml':
d = OrderedDict()
for sentence in root.findall('.//sentence'):
opinion = sentence.findall('.//Opinion')
if opinion == []:
text = sentence.find('text').text
d[text] = 0
return d
If anybody is familiar with the SemEval ABSA tasks, you'll note I've done this in a somewhat roundabout way, not making use of the id tags in the XML but I'd prefer to stick to how I've done it.
You're using the reviews as keys, which means you'll lose any duplicates. Evidently these very short reviews occurred twice.
I can't think of any reason to use the reviews as keys, especially if you care about holding on to duplicates. So why not just collect them into a list?
d = []
...
d.append(text)