python regex nltk special-characters numeric

regex pattern include alpha, special, numeric

Following are my sentences: ex:

this is first: example 234 -

this is second (example) 345 1

this is my third example (456) 3

expected output:

['this is first: example', 234, -]
['this is second (example)', 345, 1]
['this is my third example', (456), 3]

I tired using python, nltk word token and sentence token, split(), and

str1 = re.compile('([\w: ]+)|([0-9])') str1.findall('my above examples')

please suggest me a module which can provide my expected output or let me know where is my mistake in regex

Solution

With your expression, you'll get separate matches because of the alternation. If you can expect groups of three parts on one line, just make an expression that matches the whole line and captures the three groups separately. For example.

^(.*) ([\d()]+) ([-\d])

Note that this works because, while .* matches the whole line, the engine tracks back and gives up characters to match the number groups at the end.

In code:

regex = r"^(.*) ([\d()]+) ([-\d])"
matches = re.findall(regex, your_text, re.MULTILINE)
print(matches)

Output:

[('this is first: example', '234', '-'), 
('this is second (example)', '345', '1'), 
('this is my third example', '(456)', '3')]

Edit

The aforementioned pattern works well if you know how many groups of numbers to expect at the end. If that number is variable however, you would need to create a static number of repeated optional number groups like (?:\d+)? anticipating the amount of values you would have to match, but that's cumbersome and might still not meet all requirements that pop up.

So, it would be a better option to capture all numbers occurring in the source in one block and split it afterwards. For that we'll match the beginning of the string with a lazy quantifier to allow for matching all available number groups at the end of the string, which we will capture in one. For example:

^(.*?)((?: [-\d()]+)+)$

See regex demo.

Then we can split the captured group of numbers into an array, which we include with the description. Example code:

import re

test_str = (
    "this is first: example 234 -\n"
    "this is second (example) 345 1\n"
    "this is my third example (456) 3\n"
    "this is the fourth example (456) 4 12\n"
    "this is the fifth example 300 1 16 200 (2) 18")

regex = r"^(.*?)((?: [-\d()]+)+)$"
matches = re.findall(regex, test_str, re.MULTILINE)
captures = [(a, b.split()) for (a, b) in matches]

print(captures)

Output:

[
  ('this is first: example', ['234', '-']), 
  ('this is second (example)', ['345', '1']), 
  ('this is my third example', ['(456)', '3']), 
  ('this is the fourth example', ['(456)', '4', '12']), 
  ('this is the fifth example', ['300', '1', '16', '200', '(2)', '18'])
]