I'm parsing a huge file in Python, which among other things contain lines like these:
"Value names: 0 = Something, 1 = Something, else, 4 = A value, 5 = BAD - enough, 6 GOOD, 7 = Ugly,"
Where my goal is to put them into a dictionary.
The problem is that they are not written consistently at all, and some of them are missing commas, equal signs etc. And there may also be commas and stuff in the names. The numbers (keys) may also contain several digits. The only thing Im quite sure of, is that there aren't numbers in the names.
Because of this, I thought I would give Regexes a go, but it turned out to be a bit more difficult than expected.
What I've tried is this.
s = "Value names: 0 = Something, 1 = Something, else, 4 = A value 5 = BAD, 6 GOOD, 7 = Ugly,"
pattern = re.compile(r'([0-9]+)([A-Z]+)')
for (numbers, letters) in re.findall(pattern, s):
...
However, this (perhaps obviously for the trained eye) does not work.
As I am not very well versed in Regexes, I would really appreciate some help with this, as I am not able to edit the file directly, and all manual parsing tricks seem to fall short.
You can use
(?P<keys>\d+)\s*(?:=\s*)?(?P<values>.*?)(?=\s*(?:,\s*)?(?:\d|\Z))
See the regex demo. Details:
(?P<keys>\d+)
- Group "keys": one or more digits\s*
- zero or more whitespaces(?:=\s*)?
- an optional sequence of =
and zero or more whitespaces(?P<values>.*?)
- Group "values": any zero or more chars other than line break chars, as few as possible(?=\s*(?:,\s*)?(?:\d|\Z))
- immediately to the right of the current location, there must be
\s*
- zero or more whitespaces(?:,\s*)?
- an optional sequence of ,
and zero or more whitespaces(?:\d|\Z)
- either a digit or end of string.See a Python demo:
import re
text = "Value names: 0 = Something, 1 = Something, else, 4 = A value, 5 = BAD - enough, 6 GOOD, 7 = Ugly,"
pattern = r"(?P<keys>\d+)\s*(?:=\s*)?(?P<values>.*?)(?=\s*(?:,\s*)?(?:\d|\Z))"
for match in re.finditer(pattern, text):
print(match.groupdict())
Output:
{'keys': '0', 'values': 'Something'}
{'keys': '1', 'values': 'Something, else'}
{'keys': '4', 'values': 'A value'}
{'keys': '5', 'values': 'BAD - enough'}
{'keys': '6', 'values': 'GOOD'}
{'keys': '7', 'values': 'Ugly'}