python regex pandas substring data-cleaning

Trying to get a substring using regex in Python / pandas

I know this may seem stupid but I've been looking everywhere and trying with regex and split in vain. My script never works for all type of string I have on my data set.

I have this column that contains raw data that look like (three cases):

20181223-FB-BOOST-AAAA-CC Auchy-Les-Mines - Père Noel
20161224-FB-BOOST-SSSS-CC LeMarine - XXX XXX
20161223-FB-BOOST-XXXX-CC Bonjour le monde - Blah blah

So what I want to do is to get the strings in the middle after CC and right before "-". I wrote a script that did work for the 2nd case but never the other two :

1st case: Auchy-Les-Mines
2nd case: LeMarine
3rd case: Bonjour le monde

Here is the regex that I used but never works for all cases: regex = r"\s\b.*-."

Thanks in advance !

Solution

You my use

df['Col'].str.extract(r'-CC\s+(.*?)\s+-')

If there can be line breaks between the two delimiters, add the s/dotall flag or use [\w\W]/[\s\S]/[\d\D] instead of a .:

df['Col'].str.extract(r'(?s)-CC\s+(.*?)\s+-')
#                       ^^^^
df['Col'].str.extract(r'-CC\s+([\w\W]*?)\s+-')
#                              ^^^^^^

See the regex demo.

Pattern details

-CC - a literal substring
\s+ - 1+ whitespaces
(.*?) - Group 1 (this value will be returned by .str.extract): any 0+ chars other than newline, as few as possible
\s+ - 1+ whitespaces (+ is important here)
- - a hyphen

The fact that there are \s+ patterns on both ends of (.*?) will make sure the result is already stripped from whitespace regardless of how many whitespaces there were.