I am extracting tables from some Pdf's using Python. Specifically, I am removing tables, which have the potential of overlapping
For a while, the format was the following:
TABLE A:
(stuff that ends with a %)
TABLE B:
(stuff that ends with a %)
etc, etc
I would use this regex to get each of the tables without overlapping (i.e grabbing everything between the first TABLE and the last %):
(TABLE [A-Z]:)(([^%]|\n)*)%
Recently, the format has changed and now the table ends with a distinct word (Carriage). When I try to add this to my old pattern, it no longer works properly because I was using a negated set (And placing the whole word in here begins negating individual letters). I do not know how to negate the whole string, and every solution I have found I have not been able to successfully integrate into the remainder of the pattern.
P.S I am aware that the Regex module has a findall that permits overlapping, but I am presently restricted to Python standard libraries in my org.
In your pattern, you could write this part ([^%]|\n)*
as ([^%]*)
as the negated character class will also match the newline.
But if you want to use a word instead of a single character that would not work using the negated character class.
What you could do is to make the dot match a newline and match as least as possible chars until you encounter Carriage
(?s)(TABLE [A-Z]:)(.*?)\bCarriage\b