Search code examples
pythonparsingtext-processingtext-parsingstring-parsing

Python: parsing text of unknown length


I have a database full of strings such as:

as.web.product.viewed(AT)2018-01-28T19:00:52.032Z(THEN)as.web.product.viewed(AT)2018-01-28T19:02:20.132Z

(another possible delimiter is "(WITH)" and action is as.web.product.purchased so ideally I'd need a solution that is as generic as possible)

There can be any number of actions in a sequence, and in more or less any order. I need to be able to isolate the action name (such as as.web.product.viewed) and the time at which it happened, as well as maintain the order of the actions.

What would be the most Python-esque way of doing this?

EDIT: desired output (for the example above) - 2 lists such as:

['as.web.product.viewed','as.web.product.viewed']
['2018-01-28T19:00:52.032Z','2018-01-28T19:02:20.132Z']

Solution

  • You could use a regular expression to split the string when text in round brackets occur:

    import re
    pat = re.compile('''\([A-Za-z]+\)''')
    s = "as.web.product.viewed(AT)2018-01-28T19:00:52.032Z(THEN)as.web.product.viewed(AT)2018-01-28T19:02:20.132Z"
    r = (re.split(pat, s))
    print (list(zip(r[::2], r[1::2]))) # group pairwise if needed !
    

    This returns:

    [('as.web.product.viewed', '2018-01-28T19:00:52.032Z'), ('as.web.product.viewed', '2018-01-28T19:02:20.132Z')]