Search code examples
pythonregexorg-mode

Python multiline regex with org-mode files


Using regex, I would like to extract certain parts from an Emacs org mode file, which are simple text files. Entries in these org files start with * and sometimes these entries do have properties. A brief example can be found below:

import re

orgfiletest = """
* headline 0
* headline 1
  :PROPERTIES:
  :KEY: lala
  :END:
* headline 2
* headline 3
  :PROPERTIES:
  :KEY: lblb
  :END:
"""

I would like to extract all entries that do have properties; the extracted entries should include these properties. So, I would like to receive the following pieces of text:

* headline 1
  :PROPERTIES:
  :KEY: lala
  :END:

and

* headline 3
  :PROPERTIES:
  :KEY: lblb
  :END:

I started with something like this

re.findall(r"\*.*\s:END:", orgfiletest, re.DOTALL)

But this also includes headline 0 and headline 2, which do not have any properties. My next attempt was to utilize look arounds but to no avail. Any help is much appreciated!

Update / Solution that works for me:

Thanks to everyone who helped me finding a solution! For future reference I included an updated MWE and the regex that works for me:

import re
orgfiletest = """
* headline 0
  more text 
* headline 1
  :PROPERTIES:
  :KEY: lala
  :END:
* headline foo 2
** bar 3
  :PROPERTIES:
  :KEY: lblb
  :FOOBAR: lblb
  :END:
* new headline
  more text
"""

re.findall(r"^\*+ .+[\r\n](?:(?!\*)\s*:.+[\r\n]?)+", orgfiletest, re.MULTILINE)

Solution

  • There are a couple of possibilities including non-regex solutions.
    As you have specifically asked for one though:

    ^\*\ headline\ \d+[\r\n] # look for "* headline digit(s) and newline
    (?:(?!\*).+[\r\n]?)+     # followed by NOT a newline at the beginning
                             # ... anything else including newlines afterwards
                             # ... at least once
    

    See a demo on regex101.com (and mind the modifiers x and m!)


    In Python this would be:

    import re
    
    rx = re.compile(r'''
                ^\*\ headline\ \d+[\r\n] 
                (?:(?!\*).+[\r\n]?)+
                ''', re.VERBOSE | re.MULTILINE)
    
    print(rx.findall(orgfiletest))
    


    A non-regex way could be (using itertools):

    from itertools import groupby
    
    result = {}; key = None
    for k, v in groupby(
            orgfiletest.split("\n"), 
            lambda line: line.startswith('* headline')):
        if k:
            item = list(v)
            key = item[len(item)-1]
        elif key is not None:
            result[key] = list(v)
    
    print(result)
    # {'* headline 1': ['  :PROPERTIES:', '  :KEY: lala', '  :END:'], '* headline 3': ['  :PROPERTIES:', '  :KEY: lblb', '  :END:', '']}
    

    This has the downside that lines starting with e.g. * headline abc or * headliner*** would be used as well. To be honest, I'd go for the regex solution here.