Search code examples
pythonregextext-mining

Extracting a List from Text using Regular Expression in Python


I am looking to extract a list of tuples from the following string:

text='''Consumer Price Index:
        +0.2% in Sep 2020

        Unemployment Rate:
        +7.9% in Sep 2020

        Producer Price Index:
        +0.4% in Sep 2020

        Employment Cost Index:
        +0.5% in 2nd Qtr of 2020

        Productivity:
        +10.1% in 2nd Qtr of 2020

        Import Price Index:
        +0.3% in Sep 2020

        Export Price Index:
        +0.6% in Sep 2020'''

I am using 'import re' for the process.

The output should be something like: [('Consumer Price Index', '+0.2%', 'Sep 2020'), ...]

I want to use a re.findall function that produces the above output, so far I have this:

re.findall(r"(:\Z)\s+(%\Z+)(\Ain )", text)

Where I am identifying the characters prior to ':', then the characters prior to '%' and then the characters after 'in'.

I'm really just clueless on how to continue. Any help would be appreciated. Thanks!


Solution

  • You can use

    re.findall(r'(\S.*):\n\s*(\+?\d[\d.]*%)\s+in\s+(.*)', text)
    # => [('Consumer Price Index', '+0.2%', 'Sep 2020'), ('Unemployment Rate', '+7.9%', 'Sep 2020'), ('Producer Price Index', '+0.4%', 'Sep 2020'), ('Employment Cost Index', '+0.5%', '2nd Qtr of 2020'), ('Productivity', '+10.1%', '2nd Qtr of 2020'), ('Import Price Index', '+0.3%', 'Sep 2020'), ('Export Price Index', '+0.6%', 'Sep 2020')]
    

    See the regex demo and the Python demo.

    Details

    • (\S.*) - Group 1: a non-whitespace char followed with any zero or more chars other than line break chars as many as possible
    • : - a colon
    • \n - a newline
    • \s* - 0 or more whitespaces
    • (\+?\d[\d.]*%) - Group 2: optional +, a digit, zero or more digits/dots, and a %
    • \s+in\s+ - in enclosed with 1+ whitespaces
    • (.*) - Group 3: any zero or more chars other than line break chars as many as possible