Search code examples
pythonpyparsing

pyparsing: how to get token location?


I have a simple pyparsing grammar that matches numbers separated by spaces:

from pyparsing import *
NUMBER = Word( nums )
STATEMENT = ZeroOrMore( NUMBER )
print( STATEMENT.parseString( "1 2 34" ) )

Given 1 2 34 test string it returns 3 strings that are parsed tokens. But how do I find the location of each token in the original string? I need it for "kind of" syntax highlighting.


Solution

  • Add this parse action to NUMBER:

    NUMBER.setParseAction(lambda locn,tokens: (locn,tokens[0]))
    

    Parse actions can be passed the tokens that were parsed for a given expression, the location of the parse, and the original string. You can pass functions to setParseAction with any of these signatures:

    fn()
    fn(tokens)
    fn(locn,tokens)
    fn(srctring,locn,tokens)
    

    For your needs, all you need is the location and the parsed tokens.

    After adding this parse action, your parsed results now look like:

    [(0, '1'), (2, '2'), (4, '34')]
    

    EDIT:

    Since my original answer to this post, I've added to pyparsing the locatedExpr helper, which will give both the starting and ending location for a particular expression. Now this can be written simply as:

    NUMBER = locatedExpr(Word(nums))
    

    Here is the full script/output:

    >>> from pyparsing import *
    ... NUMBER = locatedExpr(Word( nums ))
    ... STATEMENT = ZeroOrMore( NUMBER )
    ... print( STATEMENT.parseString( "1 2 34" ).dump() )
    
    [[0, '1', 1], [2, '2', 3], [4, '34', 6]]
    [0]:
      [0, '1', 1]
      - locn_end: 1
      - locn_start: 0
      - value: '1'
    [1]:
      [2, '2', 3]
      - locn_end: 3
      - locn_start: 2
      - value: '2'
    [2]:
      [4, '34', 6]
      - locn_end: 6
      - locn_start: 4
      - value: '34'