Search code examples
pythonregexcase-insensitive

Python Regex to Capture Proceeding Text - mixing cas insensitivity in group


Example Link

RegEx Group returning issue:

(?P<qa_type>(Q|A|Mr[\.|:]? [a-z]+|Mrs[\.|:]? [a-z]+|Ms[\.|:]? [a-z]+|Miss[\.|:]? [a-z]+|Dr[\.|:]? [a-z]+))?([\.|:|\s]+)?

Objective: To extract text from proceeding transcript pdfs for each question/answer/speaker type.

Using Python: interage through pages in PDF extracted text and group Qestion/Answer text.

Desired Results = qa_type, page_start, page_end, line_num_start, line_num_end, qa_text

ISSUE: For the [Q|A] designators, I only want upper case, but for the speaker Titles (Mr, Mrs., Dr., etc.) case insensitive is required, both Q|A and spearker salutation a single 'qa_type' group.

Request: How do I prevent 'qa_type' from captureing 'a' or 'q'? See lines 2 and 17 on pp 275.

Example bad extract - line 17 'a'

regex = r"(^(?P<line_num>[1-9]|1[0-9]|2[0-2])\b +)(?P<qa_type>(Q|A|Mr[\.|:]? [a-z]+|Mrs[\.|:]? [a-z]+|Ms[\.|:]? [a-z]+|Miss[\.|:]? [a-z]+|Dr[\.|:]? [a-z]+))?([\.|:|\s]+)?(?P<type_text>\b.*)|page (?P<page_num>\d{1,3})"

Solution

  • This sounds pretty similar to this question. Unfortunately, it seems like python inline flag modifiers have been deprecated. You can still try to use them, in which case your regex would look like this (without the global case-insensitive flag):

    (^(?P<line_num>[1-9]|1[0-9]|2[0-2])\b +)(?P<qa_type>(Q|A|(?i)Mr[.|:]? [a-z]+|Mrs[.|:]? [a-z]+|Ms[.|:]? [a-z]+|Miss[.|:]? [a-z]+|Dr[.|:]? [a-z]+(?-i)))?([.|:|\s]+)?(?P<type_text>\b.*)|(?i)page(?-i) (?P<page_num>\d{1,3})
    

    The alternative is to just specify both the lowercase and uppercase characters every time you want a case-insensitive letter (again, without the global case-insensitive flag):

    (^(?P<line_num>[1-9]|1[0-9]|2[0-2])\b +)(?P<qa_type>(Q|A|[mM][rR][.|:]? [a-zA-Z]+|[mM][rR][sS][.|:]? [a-zA-Z]+|[mM][sS][.|:]? [a-zA-Z]+|[mM][iI][sS][sS][.|:]? [a-zA-Z]+|[dD][rR][.|:]? [a-zA-Z]+))?([.|:|\s]+)?(?P<type_text>\b.*)|[pP][aA][gG][eE] (?P<page_num>\d{1,3})
    

    Updated regex101 link