Search code examples
pythonregexbacktracking

regex catastrophic backtracking ; extracting words starts with capital before the specific word


I'm relatively new to Python world and having trouble with regex.

I'm trying to extract Firm's name before the word 'sale(s)' (or Sale(s)).

I found that Firm's names in my text data are all start with capital letter(and the other parts can be lowercase or uppercase or numbers or '-' or ', for example 'Abc Def' or 'ABC DEF' or just 'ABC' or 'Abc'),

and some of them are taking forms like ('Abc and Def' or 'Abc & Def').

For example,

from the text,

;;;;;PRINCIPAL CUSTOMERS In fiscal 2005, the Company derived approximately 21% ($4,782,852) of its consolidated revenues from continuing operations from direct transactions with Kmart Corporation. Sales of Computer products was good. However, Computer's Parts and Display Segment sale has been decreasing.

I only want to extract 'Computer's Parts and Display Segment'.

So I tried to create a regex

((?:(?:[A-Z]+[a-zA-Z\-0-9\']*\.?\s?(?:and |\& )?)+)+?(?:[S|s]ales?\s))

( 1.[A-Z]+[a-zA-Z-0-9\']*.?\s => this part is to find words start with capital letter and other parts are composed of a-z or A-Z or 0-9 or - or ' or . .

  1. (?:and |\& )? => this part is to match word with and or & )

However, at https://regex101.com/ it calls out catastrophic backtracking and I read some related articles, but still cannot find way to solve this problem.

Could you help me?

Thanks!


Solution

  • Overview

    Pointing out a few things in your pattern:

    • [a-zA-Z\-0-9\'] You don't need to escape ' here. Also, you can just place - at the start or end of the set and you won't need to escape it.
    • \& The ampersand character doesn't need to be escaped.
    • [S|s] Says to match either S, |, or s, thus you could potentially match |ales. The correct way to write this is [Ss].

    Code

    See regex in use here

    (?:(?:[A-Z][\w'-]*|and) +)+(?=[sS]ales?)
    

    Results

    Input

    ;;;;;PRINCIPAL CUSTOMERS In fiscal 2005, the Company derived approximately 21% ($4,782,852) of its consolidated revenues from continuing operations from direct transactions with Kmart Corporation. Sales of Computer products was good. However, Computer's Parts and Display Segment sale has been decreasing.

    Output

    Computer's Parts and Display Segment 
    

    Explanation

    • (?:(?:[A-Z][\w'-]*|and) +)+ Match this one or more times
      • (?:[A-Z][\w'-]*|and) Match either of the following
        • [A-Z][\w'-]* Match any uppercase ASCII character, followed by any number of word characters, apostrophes ' or hyphens -
        • and Match this literally
      • + Match one or more spaces
    • (?=[sS]ales?) Positive lookahead ensuring any of the words sale, Sale, sales, or Sales follows