Search code examples
regexpython-re

Regex - Match between two strings with an exclusion to avoid overlapping


I am extracting tables from some Pdf's using Python. Specifically, I am removing tables, which have the potential of overlapping

For a while, the format was the following:

TABLE A:

(stuff that ends with a %)

TABLE B:

(stuff that ends with a %)

etc, etc

I would use this regex to get each of the tables without overlapping (i.e grabbing everything between the first TABLE and the last %):

(TABLE [A-Z]:)(([^%]|\n)*)%

Recently, the format has changed and now the table ends with a distinct word (Carriage). When I try to add this to my old pattern, it no longer works properly because I was using a negated set (And placing the whole word in here begins negating individual letters). I do not know how to negate the whole string, and every solution I have found I have not been able to successfully integrate into the remainder of the pattern.

P.S I am aware that the Regex module has a findall that permits overlapping, but I am presently restricted to Python standard libraries in my org.


Solution

  • In your pattern, you could write this part ([^%]|\n)* as ([^%]*) as the negated character class will also match the newline.

    But if you want to use a word instead of a single character that would not work using the negated character class.

    What you could do is to make the dot match a newline and match as least as possible chars until you encounter Carriage

    (?s)(TABLE [A-Z]:)(.*?)\bCarriage\b
    

    Regex demo