Regular expression that takes <...> as one item in "foo bar <hello world> and so on" (Goal: Simple music/lilypond parsing)

I am using the re module in Python(3) and want to substitute (re.sub(regex, replace, string)) a string in the following format

"foo <bar e word> f ga <foo b>"

"#foo <bar e word> #f #ga <foo b>"

or even

"#foo #<bar e word> #f #ga #<foo b>"

But I can't isolate single words from word boundaries within a <...> construct.

Help would be nice!

P.S 1

The whole story is a musical one: I have strings in the Lilypond format (or better, a subset of the very simple core format, just notes and durations) and want to convert them to python pairs int(duration),list(of pitch strings). Performance is not important so I can convert them back and forth, iterate with python lists, split strings and join them again etc. But for the above problem I did not found an answer.

Source String

"c'4 d8 < e' g' >16 fis'4 a,, <g, b'> c''1"

should result in

[
(4, ["c'"]),
(8, ["d"]),
(16, ["e'", "g'"]),
(4, ["fis'"]),
(0, ["a,,"]),
(0, ["g", "b'"]),
(1, ["c''"]),
]

the basic format is String+Number like so : e4 bes16

List item
the string can consist of multiple, at least one, [a-zA-Z] chars
the string is followed by zero or more digits: e bes g4 c16
the string is followed by zero or more ' or , (not combined): e' bes, f'''2 g,,4
the string can be substituted by a list of strings, list limiters are <>: 4 The number comes behind the >, no space allowed

P.S. 2

The goal is NOT to create a Lilypond parser. Is it really just for very short snippets with no additional functionality, no extensions to insert notes. If this does not work I would go for another format (simplified) like ABC. So anything that has to do with Lilypond ("Run it trough lilypond, let it give out the music data in Scheme, parse that") or its toolchain is certainly NOT the answer to this question. The package is not even installed.

Solution

Your first question can be answered in this way:

>>> import re
>>> t = "foo <bar e word> f ga <foo b>"
>>> t2 = re.sub(r"(^|\s+)(?![^<>]*?>)", " #", t).lstrip()
>>> t2
'#foo #<bar e word> #f #ga #<foo b>'

I added lstrip() to remove the single space that occurs before the result of this pattern. If you want to go with your first option, you could simply replace #< with <.

Your second question can be solved in the following manner, although you might need to think about the , in a list like ['g,', "b'"]. Should the comma from your string be there or not? There may be a faster way. The following is merely proof of concept. A list comprehension might take the place of the final element, although it would be farily complicated.

>>> s = "c'4 d8 < e' g' >16 fis'4 a,, <g, b'> c''1"
>>> q2 = re.compile(r"(?:<)\s*[^>]*\s*(?:>)\d*|(?<!<)[^\d\s<>]+\d+|(?<!<)[^\d\s<>]+")
>>> s2 = q2.findall(s)
>>> s3 = [re.sub(r"\s*[><]\s*", '', x) for x in s2]
>>> s4 = [y.split() if ' ' in y else y for y in s3]
>>> s4
["c'4", 'd8', ["e'", "g'16"], "fis'4", 'a,,', ['g,', "b'"], "c''1"]
>>> q3 = re.compile(r"([^\d]+)(\d*)")
>>> s = []
>>> for item in s4:
    if type(item) == list:
            lis = []
            for elem in item:
                    lis.append(q3.search(elem).group(1))
                    if q3.search(elem).group(2) != '':
                            num = q3.search(elem).group(2)
            if q3.search(elem).group(2) != '':
                    s.append((num, lis))
            else:
                    s.append((0, lis))
    else:
            if q3.search(item).group(2) != '':
                    s.append((q3.search(item).group(2), [q3.search(item).group(1)]))
            else:
                    s.append((0, [q3.search(item).group(1)]))


>>> s
[('4', ["c'"]), ('8', ['d']), ('16', ["e'", "g'"]), ('4', ["fis'"]), (0, ['a,,']), (0, ['g,', "b'"]), ('1', ["c''"])]