Search code examples
python-3.xregexstringsplitpython-re

Split string elements based on multiple words delimiters in Python


Given a text list as follows:

text = ["Nanjing Office and Retail Market Overview 2019 Second Quarter", 
"Xi'an Office and Retail Market Overview 2020 Q1", 
"Suzhou office and retail overview 2019 fourth quarter DTZ Research", 
"marketbeat Shanghai office Second quarter of 2020 Future New Grade A office supply in non-core business districts One-year trend Although the epidemic in Shanghai has been controlled in a timely and effective manner, the negative impact of the epidemic on Shanghai's commercial real estate The impact continues.", 
"Shanghai office September 2019 marketbeats 302.7 -0.4% 12.9% rent rent growth vacancy"]

I would like to use multiple words (Market, quarter, marketbeats) to split each string element, then get the first part including the delimiter words:

for string in text:
    # string = string.lower()
    split_str = re.split(r"[Market|quarter|marketbeats]", string)
    print(split_str)

Out:

['n', 'njing offic', ' ', 'nd ', '', '', '', 'il ', '', '', '', '', '', ' ov', '', 'vi', 'w 2019 ', '', 'cond ', '', '', '', '', '', '', '']
["xi'", 'n offic', ' ', 'nd ', '', '', '', 'il ', '', '', '', '', '', ' ov', '', 'vi', 'w 2020 ', '1'],
...

But the expected result will be like this:

"Nanjing Office and Retail Market", 
"Xi'an Office and Retail Market", 
"Suzhou office and retail overview 2019 fourth quarter", 
"marketbeat Shanghai office Second quarter", 
"Shanghai office September 2019 marketbeats"

How could I get the correct result in Python? Thanks.


Solution

  • You could use an re.findall approach here:

    text = ["Nanjing Office and Retail Market Overview 2019 Second Quarter", "Xi'an Office and Retail Market Overview 2020 Q1", "Suzhou office and retail overview 2019 fourth quarter DTZ Research", "marketbeat Shanghai office Second quarter of 2020 Future New Grade A office supply in non-core business districts One-year trend Although the epidemic in Shanghai has been controlled in a timely and effective manner, the negative impact of the epidemic on Shanghai's commercial real estate The impact continues.", "Shanghai office September 2019 marketbeats 302.7 -0.4% 12.9% rent rent growth vacancy"]
    output = [re.findall(r'^.*?\b(?:Market|quarter|marketbeats|$)\b', x)[0] for x in text]
    print(output)
    

    This prints:

    ['Nanjing Office and Retail Market',
     "Xi'an Office and Retail Market",
     'Suzhou office and retail overview 2019 fourth quarter',
     'marketbeat Shanghai office Second quarter',
     'Shanghai office September 2019 marketbeats']