I know this may seem stupid but I've been looking everywhere and trying with regex and split in vain. My script never works for all type of string I have on my data set.
I have this column that contains raw data that look like (three cases):
20181223-FB-BOOST-AAAA-CC Auchy-Les-Mines - Père Noel
20161224-FB-BOOST-SSSS-CC LeMarine - XXX XXX
20161223-FB-BOOST-XXXX-CC Bonjour le monde - Blah blah
So what I want to do is to get the strings in the middle after CC and right before "-". I wrote a script that did work for the 2nd case but never the other two :
1st case: Auchy-Les-Mines
2nd case: LeMarine
3rd case: Bonjour le monde
Here is the regex that I used but never works for all cases: regex = r"\s\b.*-."
Thanks in advance !
You my use
df['Col'].str.extract(r'-CC\s+(.*?)\s+-')
If there can be line breaks between the two delimiters, add the s
/dotall
flag or use [\w\W]
/[\s\S]
/[\d\D]
instead of a .
:
df['Col'].str.extract(r'(?s)-CC\s+(.*?)\s+-')
# ^^^^
df['Col'].str.extract(r'-CC\s+([\w\W]*?)\s+-')
# ^^^^^^
See the regex demo.
Pattern details
-CC
- a literal substring\s+
- 1+ whitespaces(.*?)
- Group 1 (this value will be returned by .str.extract
): any 0+ chars other than newline, as few as possible\s+
- 1+ whitespaces (+
is important here)-
- a hyphenThe fact that there are \s+
patterns on both ends of (.*?)
will make sure the result is already stripped from whitespace regardless of how many whitespaces there were.