Search code examples
pythonpython-3.xnlpcharacterprocessing

Preprocessing to get rid of not hyphen but dash in sentences


What I would like to do

I would like to get rid of not hyphen but dash in sentences for NLP preprocessing.

Input

samples = [
    'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
    'He is afraid of two things — spiders and senior prom.' #dash
    'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]

Expected Output

#output
['A former employee of the accused company','offered a statement off the record.']
['He is afraid of two things', 'spiders and senior prom.']
['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']

The above sentences are from the following two articles about hyphen and dash.

Problem

  1. The first process to get rid of the symbol '-' was failed, and it is difficult to understand the reason why the second and third sentence were combined without single quotation ('').
#output
['A former employee of the accused company, — — —, offered a statement off the record.', 
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
  1. I have no idea how I can write code to distinguish hyphen and dash.

Current Code

samples = [
    'A former employee of the accused company, — — —, offered a statement off the record.', #dash
    'He is afraid of two things—spiders and senior prom.' #dash
    'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]

ignore_symbol = ['-']
for i in range(len(samples)):
    text = samples[i]
    ret = []
    for word in text.split(' '):
        ignore = len(word) <= 0 
        for iw in ignore_symbol:
            if word == iw:
                ignore = True
                break
        if not ignore:
            ret.append(word)

    text = ' '.join(ret)
    samples[i] = text
print(samples)

#output
['A former employee of the accused company, — — —, offered a statement off the record.', 
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']

for i in range (len(samples)):
    list_temp = []
    text = samples[i]
    list_temp.extend([x.strip() for x in text.split(',') if not x.strip() == ''])
    samples[i] = list_temp
print(samples)

#output
[['A former employee of the accused company',
  '— — —',
  'offered a statement off the record.'],
 ['He is afraid of two things—spiders and senior prom.Fifty-six bottles of pop on the wall',
  'fifty-six bottles of pop.']]

Develop Environment

Python 3.7.0


Solution

  • If you are looking for non-regex solution, Unicode point for dash is 8212, so you can replace those with ',', then split by ',' and then add non-whitespace sentences:

    >>> samples = [
        'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
        'He is afraid of two things — spiders and senior prom.', #dash
        'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
    ]
    >>> output = [[
                   sentence.strip() for sentence in elem.replace(chr(8212), ',').split(',') 
                   if sentence.strip()
                  ] for elem in samples]
    >>> output
    [['A former employee of the accused company',
      'offered a statement off the record.'],
     ['He is afraid of two things', 'spiders and senior prom.'],
     ['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']]