Using regex, I would like to extract certain parts from an Emacs org mode file, which are simple text files. Entries in these org files start with *
and sometimes these entries do have properties. A brief example can be found below:
import re
orgfiletest = """
* headline 0
* headline 1
:PROPERTIES:
:KEY: lala
:END:
* headline 2
* headline 3
:PROPERTIES:
:KEY: lblb
:END:
"""
I would like to extract all entries that do have properties; the extracted entries should include these properties. So, I would like to receive the following pieces of text:
* headline 1
:PROPERTIES:
:KEY: lala
:END:
and
* headline 3
:PROPERTIES:
:KEY: lblb
:END:
I started with something like this
re.findall(r"\*.*\s:END:", orgfiletest, re.DOTALL)
But this also includes headline 0
and headline 2
, which do not have any properties. My next attempt was to utilize look arounds but to no avail. Any help is much appreciated!
Update / Solution that works for me:
Thanks to everyone who helped me finding a solution! For future reference I included an updated MWE and the regex that works for me:
import re
orgfiletest = """
* headline 0
more text
* headline 1
:PROPERTIES:
:KEY: lala
:END:
* headline foo 2
** bar 3
:PROPERTIES:
:KEY: lblb
:FOOBAR: lblb
:END:
* new headline
more text
"""
re.findall(r"^\*+ .+[\r\n](?:(?!\*)\s*:.+[\r\n]?)+", orgfiletest, re.MULTILINE)
There are a couple of possibilities including non-regex solutions.
As you have specifically asked for one though:
^\*\ headline\ \d+[\r\n] # look for "* headline digit(s) and newline
(?:(?!\*).+[\r\n]?)+ # followed by NOT a newline at the beginning
# ... anything else including newlines afterwards
# ... at least once
See a demo on regex101.com (and mind the modifiers x
and m
!)
Python
this would be:
import re
rx = re.compile(r'''
^\*\ headline\ \d+[\r\n]
(?:(?!\*).+[\r\n]?)+
''', re.VERBOSE | re.MULTILINE)
print(rx.findall(orgfiletest))
itertools
):
from itertools import groupby
result = {}; key = None
for k, v in groupby(
orgfiletest.split("\n"),
lambda line: line.startswith('* headline')):
if k:
item = list(v)
key = item[len(item)-1]
elif key is not None:
result[key] = list(v)
print(result)
# {'* headline 1': [' :PROPERTIES:', ' :KEY: lala', ' :END:'], '* headline 3': [' :PROPERTIES:', ' :KEY: lblb', ' :END:', '']}
This has the downside that lines starting with e.g. * headline abc
or * headliner***
would be used as well. To be honest, I'd go for the regex
solution here.