Search code examples
pythonsubstringextracttext-to-column

Extract multiple strings between brackets or two different delimiters


I'm trying to extract data between these two delimiters [,]
for two cases: Beginning with #CB or CR:

Input data:

Scenario,Order_ID,Review
1,1234,<Order No. 10><Remark>Food was good</Remark><UserID>7890</UserID><Filter>[#CB:Customer happy with service]</Filter><Rating>5</Rating>
2,1235,<Rating>Five</Rating><Order No. 17><UserID>7880</UserID><Filter>[#CB:Customer had a good time]</Filter><Remark>Food was good</Remark><Additional>Service up to par</Additional><Remark>[#CB:Customer will return again]</Remark><End>Thats all</End>
3,1236,<Start>We shall begin</Start><Intro>[CR:Fine Dining Experience]</Intro><Rating>Three</Rating><Order No. 12><Filter>[#CB:Food was well prepared]</Filter><Remark>Food was good</Remark><Additional>Ambiance was lively</Additional><Remark>[CR:Hygine was maintained]</Remark><End>Could be better</End><UserID>7880</UserID>

Data wanted:

Scenario,Order_ID,Review1,Review2,Review3,Review4
1,1234,#CB:Customer happy with service,,,
2,1235,#CB:Customer had a good time,#CB:Customer will return again,,
3,1236,CR:Fine Dining Experience,#CB:Food was well prepared,CR:Hygine was maintained,

Solution

  • Building on the other answer, this is how you could get it in the format you asked for.

    import re
    import pandas as pd
    
    p = re.compile(r"\[(.*?)\]")
    
    s1 = "<Order No. 10>Food was good7890[#CB:Customer happy with service]Five"
    s2 = " Five<Order No. 17>7880[#CB:Customer had a good time]Food was goodService up to par[#CB:Customer will return again]Thats all"
    print(p.findall(s1))
    print(p.findall(s2))
    
    d = pd.DataFrame({'scenario':[1,2], 'order_id':['1234', '1235'], 'reviews':[s1,s2] })
    
    def padList(l, length=4):
      l = l[:length]
      l += [''] * (length - len(l))
      return l
    
    d[['review1','review2','review3','review4']] = d.apply(lambda row: padList(p.findall(row['reviews'])), axis=1, result_type='expand')
    
    d