I have the following text chunk:
string = """
apples: 20
oranges: 30
ripe: yes
farmers:
elmer fudd
lives in tv
farmer ted
lives close
farmer bill
lives far
selling: yes
veggies:
carrots
potatoes
"""
I am trying to find a good regex that will allow me to parse out the key values. I can grab the single line key values with something like:
'(.+?):\s(.+?)\n'
However, the problem comes when I hit farmers, or veggies.
Using the re flags, I need to do something like:
re.findall( '(.+?):\s(.+?)\n', string, re.S),
However, I am having a heck of a time grabbing all of the values associated with farmers.
There is a newline after each value, and a tab, or series of tabs before the values when they are multiline.
and goal is to have something like:
{ 'apples': 20, 'farmers': ['elmer fudd', 'farmer ted'] }
etc.
Thank you in advance for your help.
Here's a really dumb parser that takes into account your (apparent) indentation rules:
def parse(s):
d = {}
lastkey = None
for fullline in s:
line = fullline.strip()
if not line:
pass
elif ':' not in line:
indent = len(fullline) - len(fullline.lstrip())
if lastindent is None:
lastindent = indent
if lastindent == indent:
lastval.append(line)
else:
if lastkey:
d[lastkey] = lastval
lastkey = None
if line.endswith(':'):
lastkey, lastval, lastindent = key, [], None
else:
key, _, value = line.partition(':')
d[key] = value.strip()
if lastkey:
d[lastkey] = lastval
lastkey = None
return d
import pprint
pprint(parse(string.splitlines()))
The output is:
{'apples': '20',
'oranges': '30',
'ripe': ['elmer fudd', 'farmer ted', 'farmer bill'],
'selling': ['carrots', 'potatoes']}
I think this is already complicated enough that it would look cleaner as an explicit state machine, but I wanted to write this in terms that any novice could understand.