regexregex-group

Trim whitespaces and line breaks from a Regex Capture Group


I've got the following HTML code:

                      <span><b id="PHONE">Phone Number:</b> </span>
                    <span
                      style="margin-left: 8px; color: #1c3452; font-family: 'PT Sans', sans-serif; font-size: 16px; font-style: normal; font-weight: 700; line-height: 16px;">+32
                      123 45 67 89</span>
                    <br />

<span><b id="EMAIL">Email Address:</b> </span>
                    <span
                      style="margin-left: 8px; color: #1c3452; font-family: 'PT Sans', sans-serif; font-size: 16px; font-style: normal; font-weight: 700; line-height: 16px;">[email protected]/span>
                    <br />

With an expression I want to capture the values of the Phone Number and Email Address. So far for the phone number I tried this:

<b id="PHONE">(Phone Number)(\s*):(\s*)<\/b>(\s*)<\/span>(\s*)<span[^>]*>(\s*)(?<Phone>[^<]*)<\/span>

It succesfully captures the phone number, but it also grabs the line break and white spaces.

I prepared an example here: https://regex101.com/r/X11veJ/1

Anyone an idea how I can successfully can get the phone number without line breaks and white spaces? Is this even possible?

I'm pretty much a rookie in regex; so I'm not sure if the written expression is even good.

Appreciate it! Thanks!


Solution

  • "... With an expression I want to capture the values of the Phone Number and Email Address. ...

    ... Anyone an idea how I can successfully can get the phone number without line breaks and white spaces? Is this even possible? ..."

    Use a capture pattern.

    To preface, you did not specify which regex engine you're using, so you may need to adjust the syntax, accordingly.

    This will capture the phone number.
    The digits are contained within groups 1 through 5.

    (?s)id=\"PHONE\">Phone Number:.+?(?:(\+\d+)[^\d]+(\d+)[^\d]+(\d+)[^\d]+(\d+)[^\d]+(\d+))<\/span>
    

    And, this will capture the email address.
    Note, the closing span tag, after the email address, did not start with a < character.
    I imagine this may have been a typo, on your behalf.
    If not, just leave a comment, and I can re-factor the pattern.

    (?s)id=\"EMAIL\">Email Address:.+?<\/span>.+?>(.+?)<\/span>
    

    From here, you can merge both patterns, to create a single pattern.

    (?s)id=\"PHONE\">Phone Number:.+?(?:(\+\d+)[^\d]+(\d+)[^\d]+(\d+)[^\d]+(\d+)[^\d]+(\d+))<\/span>.+?id=\"EMAIL\">Email Address:.+?<\/span>.+?>(.+?)<\/span>
    

    Output

    +32 123 45 67 89
    [email protected]