Search code examples
pythonpython-3.xregextext-extraction

How to extract the list of text between the pattern using RegEx?


I have text like:

05/06/21 05/06/21 Margin Div/Int - Income ACTIVISION BLIZZARD INC

COM
Payable: 05/06/2021
QUALIFIED DIVIDENDS 23.50 

ATVI - 0.00 23.50 (9,425.77)

05/13/21 05/13/21 Margin Div/Int - Income APPLE INC
COM
Payable: 05/13/2021
QUALIFIED DIVIDENDS 6.16 

AAPL - 0.00 6.16 (9,419.61)

05/28/21 05/28/21 Margin Div/Int - Income STARBUCKS CORP
COM
Payable: 05/28/2021
QUALIFIED DIVIDENDS 18.00 

SBUX - 0.00 18.00 (9,401.61)

05/28/21 05/28/21 Margin Div/Int - Expense MARGIN INTEREST CHARGE
Payable: 05/28/2021 

 - - 0.00 (73.03) (9,474.64)

I want to extract individual records, such as:

05/06/21 05/06/21 Margin Div/Int - Income ACTIVISION BLIZZARD INC

COM
Payable: 05/06/2021
QUALIFIED DIVIDENDS 23.50 

ATVI - 0.00 23.50 (9,425.77)

and

05/13/21 05/13/21 Margin Div/Int - Income APPLE INC
COM
Payable: 05/13/2021
QUALIFIED DIVIDENDS 6.16 

AAPL - 0.00 6.16 (9,419.61)

and

05/28/21 05/28/21 Margin Div/Int - Expense MARGIN INTEREST CHARGE
Payable: 05/28/2021 

 - - 0.00 (73.03) (9,474.64)

Here the pattern of each record should start with date(\d+/\d+/\d) and end with (\n\n\d+/\d+/\d)

I have tried like (re.findall(r'\d+/\d+/\d(.*?)\n\n\d+/\d+/\d+',a)). But it doesn't works as expected


Solution

  • You can match

    .+?(?=\s*(?:\d{2}\/\d{2}\/\d{2} ){2}|$)
    

    with 'g' ("global") and 's' ("single line" or "dot-all") flags set. 's' causes periods to match all characters, including line terminators.

    Demo

    The regular expression can be broken down as follows.

    .+?                        # match one or more chars, lazily
    (?=                        # begin a positive lookahead
      \s*                      # match zero or more whitespaces
      (?:                      # begin a non-capture group 
        \d{2}\/\d{2}\/\d{2}[ ] # match a date string followed by a space
      ){2}                     # end the non-capture group and execute it twice
    |                          # or
      $                        # match the end of the string
    )                          # end positive lookahead