I would like to get rid of not hyphen but dash in sentences for NLP preprocessing.
Input
samples = [
'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
'He is afraid of two things — spiders and senior prom.' #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
Expected Output
#output
['A former employee of the accused company','offered a statement off the record.']
['He is afraid of two things', 'spiders and senior prom.']
['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']
The above sentences are from the following two articles about hyphen and dash.
#output
['A former employee of the accused company, — — —, offered a statement off the record.',
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
samples = [
'A former employee of the accused company, — — —, offered a statement off the record.', #dash
'He is afraid of two things—spiders and senior prom.' #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
ignore_symbol = ['-']
for i in range(len(samples)):
text = samples[i]
ret = []
for word in text.split(' '):
ignore = len(word) <= 0
for iw in ignore_symbol:
if word == iw:
ignore = True
break
if not ignore:
ret.append(word)
text = ' '.join(ret)
samples[i] = text
print(samples)
#output
['A former employee of the accused company, — — —, offered a statement off the record.',
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
for i in range (len(samples)):
list_temp = []
text = samples[i]
list_temp.extend([x.strip() for x in text.split(',') if not x.strip() == ''])
samples[i] = list_temp
print(samples)
#output
[['A former employee of the accused company',
'— — —',
'offered a statement off the record.'],
['He is afraid of two things—spiders and senior prom.Fifty-six bottles of pop on the wall',
'fifty-six bottles of pop.']]
Python 3.7.0
If you are looking for non-regex solution, Unicode point for dash is 8212
, so you can replace those with ','
, then split by ','
and then add non-whitespace sentences:
>>> samples = [
'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
'He is afraid of two things — spiders and senior prom.', #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
>>> output = [[
sentence.strip() for sentence in elem.replace(chr(8212), ',').split(',')
if sentence.strip()
] for elem in samples]
>>> output
[['A former employee of the accused company',
'offered a statement off the record.'],
['He is afraid of two things', 'spiders and senior prom.'],
['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']]