I am looking to extract a list of tuples from the following string:
text='''Consumer Price Index:
+0.2% in Sep 2020
Unemployment Rate:
+7.9% in Sep 2020
Producer Price Index:
+0.4% in Sep 2020
Employment Cost Index:
+0.5% in 2nd Qtr of 2020
Productivity:
+10.1% in 2nd Qtr of 2020
Import Price Index:
+0.3% in Sep 2020
Export Price Index:
+0.6% in Sep 2020'''
I am using 'import re' for the process.
The output should be something like: [('Consumer Price Index', '+0.2%', 'Sep 2020'), ...]
I want to use a re.findall function that produces the above output, so far I have this:
re.findall(r"(:\Z)\s+(%\Z+)(\Ain )", text)
Where I am identifying the characters prior to ':', then the characters prior to '%' and then the characters after 'in'.
I'm really just clueless on how to continue. Any help would be appreciated. Thanks!
You can use
re.findall(r'(\S.*):\n\s*(\+?\d[\d.]*%)\s+in\s+(.*)', text)
# => [('Consumer Price Index', '+0.2%', 'Sep 2020'), ('Unemployment Rate', '+7.9%', 'Sep 2020'), ('Producer Price Index', '+0.4%', 'Sep 2020'), ('Employment Cost Index', '+0.5%', '2nd Qtr of 2020'), ('Productivity', '+10.1%', '2nd Qtr of 2020'), ('Import Price Index', '+0.3%', 'Sep 2020'), ('Export Price Index', '+0.6%', 'Sep 2020')]
See the regex demo and the Python demo.
Details
(\S.*)
- Group 1: a non-whitespace char followed with any zero or more chars other than line break chars as many as possible:
- a colon\n
- a newline\s*
- 0 or more whitespaces(\+?\d[\d.]*%)
- Group 2: optional +
, a digit, zero or more digits/dots, and a %
\s+in\s+
- in
enclosed with 1+ whitespaces(.*)
- Group 3: any zero or more chars other than line break chars as many as possible