Search code examples
pythonregexpandassubstringdata-cleaning

Trying to get a substring using regex in Python / pandas


I know this may seem stupid but I've been looking everywhere and trying with regex and split in vain. My script never works for all type of string I have on my data set.

I have this column that contains raw data that look like (three cases):

20181223-FB-BOOST-AAAA-CC Auchy-Les-Mines - Père Noel
20161224-FB-BOOST-SSSS-CC LeMarine - XXX XXX
20161223-FB-BOOST-XXXX-CC Bonjour le monde - Blah blah

So what I want to do is to get the strings in the middle after CC and right before "-". I wrote a script that did work for the 2nd case but never the other two :

1st case: Auchy-Les-Mines
2nd case: LeMarine
3rd case: Bonjour le monde

Here is the regex that I used but never works for all cases: regex = r"\s\b.*-."

Thanks in advance !


Solution

  • You my use

    df['Col'].str.extract(r'-CC\s+(.*?)\s+-')
    

    If there can be line breaks between the two delimiters, add the s/dotall flag or use [\w\W]/[\s\S]/[\d\D] instead of a .:

    df['Col'].str.extract(r'(?s)-CC\s+(.*?)\s+-')
    #                       ^^^^
    df['Col'].str.extract(r'-CC\s+([\w\W]*?)\s+-')
    #                              ^^^^^^
    

    See the regex demo.

    Pattern details

    • -CC - a literal substring
    • \s+ - 1+ whitespaces
    • (.*?) - Group 1 (this value will be returned by .str.extract): any 0+ chars other than newline, as few as possible
    • \s+ - 1+ whitespaces (+ is important here)
    • - - a hyphen

    The fact that there are \s+ patterns on both ends of (.*?) will make sure the result is already stripped from whitespace regardless of how many whitespaces there were.