Search code examples
pythonregexpypdf

Regular Expression for a string contains if characters all in capital python


I'm extracting textual paragraph followed by text like "OBSERVATION #1" or "OBSERVATION #2" in the output from library like PyPDF2.
However there would be some error so it could be like "OBSERVA'TION #2" and I have to avoid like "Suite #300" so the rule is "IF THERE IS CHARACTER, IT WOULD BE IN CAPITAL".
Currently the python code snippet like

inspection_observation=pdfFile.getPage(z).extractText()
                if 'OBSERVATION' in inspection_observation:
                    for finding in re.findall(r"[OBSERVATION] #\d+(.*?) OBSERVA'TION #\d?", inspection_observation, re.DOTALL):

                    #print inspection_observation;
                        print finding; 

Please advise the appropriate regular expression for this instance,


Solution

  • If there should be a capital and the word can contain a ', you could use a character class where you can list the characters that are allowed and a positive lookahead.

    Then you can capture the content between those capital words and use a positive lookahead to check if what follows is another capital word followed by # and 1+ digits or the end of the string. This regex makes use of re.DOTALL where the dot matches a newline.

    (?=[A-Z']*[A-Z])[A-Z']+\s+#\d+(.*?(?=[A-Z']*[A-Z][A-Z']*\s+#\d+|$))
    

    Explanation

    • (?=[A-Z']*[A-Z]) Positive lookahead to assert what follows at least a char A-Z where a ' can occur before
    • [A-Z']+\s+#\d+ match 1+ times A-Z or ', 1+ whitespace characters and 1+ digits
    • ( Capture group
      • .*? Match any character
      • (?= Positive lookahead to assert what follows is
        • [A-Z']*[A-Z][A-Z']* Match uppercase char A-Z where a ' can be before and after
        • \s+#\d+ Match 1+ whitespace chars, # and 1+ digits or the end of the string
      • ) Close non capture group
    • ) Close capture group

    Regex demo