I am trying to extract data from several pdfs. There is one data point related to dates where the strings before the date vary across some pdfs. I checked that the individual regex statements are working, however, when I try to combine the regex statements into one statement in my for loop, I am not extracting the dates. Here are the strings that I'm trying to match along with my code for their individual regex statements that pull the date information after the 'DATE OF BIRTHDAY':
DATE OF BIRTHDAY\n01/11/2011
date_of_birthday1 = re.search('(?<=DATE OF BIRTHDAY \\n)(.*)', img).groups()
DATE OF BIRTHDAY\n\n02/14/2015
date_of_birthday2 = re.search('(?<=DATE OF BIRTHDAY \\n\\n)(.*)', img).groups()
DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll 05/07/2018
date_of_birthday3 = re.search('(?<=DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll)(.*)', img).groups()
I'm trying to combine these regex statements into an or statement so that I can use them in a for loop, like this:
date_of_birthdays = re.search('(?<=DATE OF BIRTHDAY\\n\\n)(.*)|(?<=DATE OF BIRTHDAY\\n)(.*)|(?<=DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll)(.*)', img).groups
My expected output is
df['Birthdays'] = date_of_birthdays
which will look like this:
df = pd.DataFrame({"Birthdays": ['01/11/2011', '02/14/2015', '05/07/2018']})
df
However, I am not able to pull any of the date information. Thoughts on what I'm doing wrong here?
This works
>>> import re
>>> re.findall(
... r"(?:DATE[ ]OF[ ]BIRTHDAY)(?:\\n(?:\\n)?|[ ]GIRL[ ]\\n\\ni[ ]:[ ]Pll[ ]i[ ]ii\\ni[ ]\\n\\nPll[ ])?(.*)",
... (
... r'DATE OF BIRTHDAY\n01/11/2011' + "\n"
... r'DATE OF BIRTHDAY\n\n02/14/2015' + "\n"
... r'DATE OF BIRTHDAY GIRL \n\ni : Pll i ii\ni \n\nPll 05/07/2018' + "\n"
... ))
['01/11/2011', '02/14/2015', '05/07/2018']
>>>
Regex expanded
(?: DATE [ ] OF [ ] BIRTHDAY )
(?:
\\ n
(?: \\ n )?
| [ ] GIRL [ ] \\ n \\ ni [ ] : [ ] Pll [ ] i [ ] ii \\ n i [ ] \\ n \\ n Pll [ ]
)?
( .* ) # (1)
Just fair warning, the expression's with the lookbehind asssertions
present a problem in these two alternations:
(?<= DATE [ ] OF [ ] BIRTHDAY \\ n \\ n )
( .* ) # (1)
| (?<= DATE [ ] OF [ ] BIRTHDAY \\ n )
( .* ) # (2)
It's hard do visualize, so I'm just gonna come right out and say it,
capture group 1 (the first alternation) will never match !!
The reason is that the shorter distance backwards is always checked first.
Since the .*
give it a way to match, the shorter one with a single \n
literal will always match first.
You can fix that by forcing it not to match by adding a (?!\\n)
like this
(?<= DATE [ ] OF [ ] BIRTHDAY \\ n \\ n )
( .* ) # (1)
| (?<= DATE [ ] OF [ ] BIRTHDAY \\ n )
(?! \\ n )
( .* ) # (2)
Well, that's out of the way, so here are some benchmarks of the
ways under consideration (which is not really the ideal way to do this)
Regex1: (?:DATE[ ]OF[ ]BIRTHDAY)(?:\\n(?:\\n)?|[ ]GIRL[ ]\\n\\ni[ ]:[ ]Pll[ ]i[ ]ii\\ni[ ]\\n\\nPll[ ])?(.*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 0.29 s, 294.80 ms, 294801 µs
Matches per sec: 508,817
Regex2: (?:(?<=DATE[ ]OF[ ]BIRTHDAY\\n\\n)|(?<=DATE[ ]OF[ ]BIRTHDAY\\n)(?!\\n)|(?<=DATE[ ]OF[ ]BIRTHDAY[ ]GIRL[ ]\\n\\ni[ ]:[ ]Pll[ ]i[ ]ii\\ni[ ]\\n\\nPll[ ]))(.*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 2.27 s, 2268.42 ms, 2268417 µs
Matches per sec: 66,125
Regex3: (?<=DATE[ ]OF[ ]BIRTHDAY\\n\\n)(.*)|(?<=DATE[ ]OF[ ]BIRTHDAY\\n)(?!\\n)(.*)|(?<=DATE[ ]OF[ ]BIRTHDAY[ ]GIRL[ ]\\n\\ni[ ]:[ ]Pll[ ]i[ ]ii\\ni[ ]\\n\\nPll[ ])(.*)
Options: < none >
Completed iterations: 50 / 50 ( x 1000 )
Matches found per iteration: 3
Elapsed Time: 2.76 s, 2760.81 ms, 2760809 µs
Matches per sec: 54,331