Search code examples
python-3.xpyparsing

How can I split text using pyparsing with a specific token?


PLEASE NOTE: In Splitting text into lines with pyparsing it is about how to parse a file using a single token at the end of a line which is \n that is pretty easy peasy. My question differs as I have hard time ignoring last text which is started before : and exclude it from free text search entered before filters.


On our API I have a user input like some free text port:45 title:welcome to our website and what I need to have at the end of parsing is 2 parts -> [some free text, port:45 title:welcome]

from pyparsing import *
token = "some free text port:45 title:welcome to our website"
t = Word(alphas, " "+alphanums) + Word(" "+alphas,":"+alphanums)

This does give me an error:

pyparsing.ParseException: Expected W:( ABC..., :ABC...), found ':'  (at char 21), (line:1, col:22)

Because it gets all strings up to some free text port and then :45 title:welcome to our website.

How can I get all data before port: in a separate group and port:.... in another group using pyparsing?


Solution

  • Adding " " as one of the valid characters in a Word pretty much always has this problem, and so is general a pyparsing anti-pattern. Word does its character repetition matching inside its parse() method, so there is no way to add any kind of lookahead.

    To get spaces in your expressions, you will probably need a OneOrMore, wrapped in originalTextFor, like this:

    import pyparsing as pp
    
    word = pp.Word(pp.printables, excludeChars=":")
    
    non_tag = word + ~pp.FollowedBy(":")
    
    # tagged value is two words with a ":"
    tag = pp.Group(word + ":" + word)
    
    # one or more non-tag words - use originalTextFor to get back 
    # a single string, including intervening white space
    phrase = pp.originalTextFor(non_tag[1, ...])
    
    parser = (phrase | tag)[...]
    
    parser.runTests("""\
        some free text port:45 title:welcome to our website
        """)
    

    Prints:

    some free text port:45 title:welcome to our website
    ['some free text', ['port', ':', '45'], ['title', ':', 'welcome'], 'to our website']
    [0]:
      some free text
    [1]:
      ['port', ':', '45']
    [2]:
      ['title', ':', 'welcome']
    [3]:
      to our website