Search code examples
pythonstringtokenizelexer

python string tokenization - custom lexer?


I have a string like:

<number>xx<->a<T>b<F>c<F>d<F>e<F>f<F>g<T>h<F>i<F>

How can I efficiently parse this string so that i.e.

  • xx has a value of null
  • a has a value of 1
  • b has a value of 0

Solution

  • You can parse that with Regular Expressions. We first remove the initial <word> at the start of the string, if it exists, and then look for pairs of word<word>, saving them into key,value pairs in a dictionary using the codes dictionary to convert _, F, T, to null, 0, 1.

    import re
    
    s = '<number>xx<->a<T>b<F>c<F>d<F>e<F>f<F>g<T>h<F>i<F>'
    
    m = re.match(r'<(\w*?)>', s)
    if m:
        head = m.group(1)
        s = s[m.end():]
        print(head)
    else:
        print('No head group')
    
    codes = {'-': 'null', 'F': '0', 'T': '1'}
    pat = re.compile(r'(\w*?)<([-\w]*?)>')
    
    out = {k: codes[v] for k, v in pat.findall(s)}
    print(out)
    

    output

    number
    {'xx': 'null', 'a': '1', 'b': '0', 'c': '0', 'd': '0', 'e': '0', 'f': '0', 'g': '1', 'h': '0', 'i': '0'}