I try to implement a regex to read lines such as :
* DCH : 0.80000000 *
* PYR : 100.00000000 *
* Bond ( 1, 0) : 0.80000000 *
* Angle ( 1, 0, 2) : 100.00000000 *
To that end, I wrote the following regex. It works, but I would like to have some feedback about the way to get the integer numbers in parenthesis. On the lines 3 and 4 above, the part with the integers between parenthesis (a kind of tuple of integers) is optional.
I have to define several groups to be able to define that tuple of integer as optional and to manage the fact that that tuple may contain 2, 3 or 4 integers.
In [64]: coord_patt = re.compile(r"\s+(\w+)\s+(\(((\s*\d+),?){2,4}\))?\s+:\s+(\d+.\d+)")
In [65]: line2 = "* Angle ( 1, 0, 2) : 100.00000000 *"
In [66]: m = coord_patt.search(line2)
In [67]: m.groups()
Out[67]: ('Angle', '( 1, 0, 2)', ' 2', ' 2', '100.00000000')
Another example :
In [68]: line = " * Bond ( 1, 0) : 0.80000000 *"
In [69]: m = coord_patt.search(line)
In [71]: m.groups()
Out[71]: ('Bond', '( 1, 0)', ' 0', ' 0', '0.80000000')
As you can see it works, but I do not understand why, in the groups, I got only the last integer and not the each integer separately ? Is there a way to get that integers individually or to avoid to define all that groups and catch only the group 2 which is a string of the tuple which can be easily read otherwise.
As indicated in Capturing repeating subpatterns in Python regex, the re
module doesn't support repeated captures, but regex
does.
Here are two solutions, one based on regex
, the other on re
and the safe evaluation of the tuple when one is encountered.
txt = r"""* DCH : 0.80000000 *
* PYR : 100.00000000 *
* Bond ( 1, 0) : 0.80000000 *
* Angle ( 1, 0, 2) : 100.00000000 *
"""
regex
import regex
p = regex.compile(r'\s+(\w+)\s+(?:\((?:\s*(\d+),?){2,4}\))?\s+:\s+(\d+.\d+)')
for s in txt.splitlines():
if m := p.search(s):
name = m.group(1)
tup = tuple(int(k) for k in m.captures(2) if k.isnumeric())
val = float(m.group(3))
print(f'{name!r}\t{tup!r}\t{val!r}')
Prints:
'DCH' () 0.8
'PYR' () 100.0
'Bond' (1, 0) 0.8
'Angle' (1, 0, 2) 100.0
re
import re
import ast
p = re.compile(r'\s+(\w+)\s+(\((?:\s*\d+,?){2,4}\))?\s+:\s+(\d+.\d+)')
for s in txt.splitlines():
if m := p.search(s):
name, tup, val = m.groups()
tup = ast.literal_eval(tup) if tup is not None else ()
val = float(val)
print(f'{name!r}\t{tup!r}\t{val!r}')
Prints:
'DCH' () 0.8
'PYR' () 100.0
'Bond' (1, 0) 0.8
'Angle' (1, 0, 2) 100.0