Search code examples
pythonregexpython-3.xtokenize

splitting a string using regular expression


I've been tasked to tokenize words from a corpus using regular expressions but I'm having trouble tokenizing abbreviations such as "e.g." or "i.e.". In particular, the one that occurs in the corpus that I'm looking at appears as '(N.B.--I'

string = '(N.B.--I'
pattern = r'(\w\.){2,}'
split_p = r'((\w\.){2,})'

match = re.search(pattern, string)
print(match)

split = re.split(split_p, string)
print(split)

['(', 'N.B.', '--', 'I'] is the desired output list split however when I run it...

<_sre.SRE_Match object; span=(1, 5), match='N.B.'>
['(', 'N.B.', 'B.', '--I']

I believe I can split the dashes with |-+

However, I can't understand why this B. is repeating


Solution

  • The split includes all capturing groups. Use (?:...) to create a non-capturing group around the \w. sub-pattern instead:

    split_p = r'((?:\w\.){2,})'
    

    Demo:

    >>> import re
    >>> split_p = r'((?:\w\.){2,})'
    >>> string = '(N.B.--I'
    >>> re.split(split_p, string)
    ['(', 'N.B.', '--I']
    

    Next, if you want to split on repeating dashes, just add an alternative pattern with |:

    split_p = r'((?:\w\.){2,}|-+)'
    

    Demo:

    >>> split_p = r'((?:\w\.){2,}|-+)'
    >>> re.split(split_p, string)
    ['(', 'N.B.', '', '--', 'I']
    

    This gives an empty string in-between because there are 0 characters between the N.B. split point and the -- point; you'd have to filter those out again.