Search code examples
phphtmlregexpreg-matchtext-extraction

Get number which occurs after its label text in HTML


I'm using PHP to parse an e-mail and want to get the number after a specific string.

For example, I would want to get the number 033 from a string that looks like:

 Account Number: 033 
 Account Information: Some text here

The content is actually HTML, so the input string is more accurately presented as:

<font face="Arial, Helvetica, sans-serif" color="#000099"><strong><font color="#660000">Account  Number</font></strong><font color="#660000">: 033<br><strong>Account Name</strong>: More text here<br>
    

There is always the word Account Number: and then the number and then a line break. I have:

 preg_match_all('!\d+!', $str, $matches);

But that just gets all the numbers.


Solution

  • If the number is always after Account Number: (including that space at the end), then just add that to your regex:

    preg_match_all('/Account Number: (\d+)/',$str,$matches);
    // The parentheses capture the digits and stores them in $matches[1]
    

    Results:

    $matches Array:
    (
        [0] => Array
            (
                [0] => Account Number: 033
            )
    
        [1] => Array
            (
                [0] => 033
            )
    
    )
    

    Note: If there is HTML present, then that can be included in the regex as well as long as you don't believe the HTML is subject to change. Otherwise, I suggest using an HTML DOM Parser to get to the plain-text version of your string and using a regex from there.

    With that said, the following is an example that includes the HTML in the regex and provides the same output as above:

    // Notice the delimiter 
    preg_match_all('@<font face="Arial, Helvetica, sans-serif" color="#000099"><strong><font color="#660000">Account 
    Number</font></strong><font color="#660000">: (\d+)@',$str,$matches);