Search code examples
pythonregexdata-extraction

Regex for date and and finding BP values simultaneously


I have unstructured data where I have to extract BP values and the dates(having different formats) as shown below. Right now I have a regex function to extract Bp values. I have a specific case as highlighted in the picture where consecutive dates and even single encountered dates have to be extracted(Not DOB).

enter image description here

Currently, the code I have gives only the BP values. I want the regex function for Bp and extracting the dates as well simultaneously.

I have attached the regex code below.

regex = r'\b(?:BP:?(?:-Sitting)?|Blood Pressure) \d+/\d+(?: \d+/\d+|  \d+/\d+)*(?: sm| -Lw| cB| Jr|\
    -aA| cs| -ic| ic| -RG|  kA| -sL| BL| kc| am| -sH| sH| es| ts| np| 8s| ca| Pm| JE| so| cp| v8| Eu| -cp|\
    Pm| EB| Fr| -Fr| -ms| -LN| -mT| -mk| -GF| -HO| Jp| wD| 8m| mc| -mc| Yr| -Lp| -ml| -LA| s/d| -aA| s/d|mmHg| mm Hg|\
    mm hg.|.?)?|B/P - (?:Sys|Dias)tolic \d+|(?:Sys|Dias)tolic Blood Pressure \d+ \w+\b'

The image of the current output is given below, in which dates are not included. enter image description here

Any help with this would be greatly appreciated.


Solution

  • One option is to add matching an optional / and 1 or more digits in the part where you match the \d+/\d+.

    You can shorten this part \d+/\d+(?: \d+/\d+| \d+/\d+)* to \d+/\d+(?: ?\d+/\d+)* as the only difference is matching 1 or 2 spaces in the alternation.

    Adding matching an optional forward slash and 1 or more digits in the first part and in the repetition would look like \d+/\d+(?:/\d+)?(?: ?\d+/\d+(?:/\d+)?)*

    The updated pattern:

    \b(?:BP:?(?:-Sitting)?|Blood Pressure) \d+/\d+(?:/\d+)?(?:  ?\d+/\d+(?:/\d+)?)*(?: sm| -Lw| cB| Jr|\
        -aA| cs| -ic| ic| -RG|  kA| -sL| BL| kc| am| -sH| sH| es| ts| np| 8s| ca| Pm| JE| so| cp| v8| Eu| -cp|\
        Pm| EB| Fr| -Fr| -ms| -LN| -mT| -mk| -GF| -HO| Jp| wD| 8m| mc| -mc| Yr| -Lp| -ml| -LA| s/d| -aA| s/d|mmHg| mm Hg|\
        mm hg.)?|B/P - (?:Sys|Dias)tolic \d+|(?:Sys|Dias)tolic Blood Pressure \d+ \w+\b
    

    Regex demo

    Note that I have omitted the .? at the end of the alternation, as it would match a trailing whitespace char as well.