Search code examples
phpregexsimple-html-dom

PHP Regex to find a substring from a big string - Matching start and end


I want to find title of pages from a huge haystack but that do not have any class or unique id, so i can't use DOM parser here, i am aware i must use regular expressions. Here is example of what i am trying to find:

<a href="http://example.com/xyz">
    Series Hell In Heaven information
</a>
<a href="http://example.com/123">
    Series What is going information
</a>

Output Should be an array with

[0] => Series Hell In Heaven information
[1] => Series What is going information

All series titles have start with Series and end with information. from a huge string of multiple things i only want to extract titles. Currently i am trying to use a regex but its not working, here's what i am doing right now.

$reg = "/^Series\..*information$/";
$str = $html;
preg_match_all($reg, $str, $matches);
echo "<pre>";
    print_r($matches);
echo "</pre>";

I don't know much about making regular expressions. Help would appreciated. Thanks


Solution

  • Try

     preg_match_all('/(Series.+?information)/', $str, $matches );
    

    As

    https://regex101.com/r/oJ0jZ4/1

    As I said in the comments, remove the literal \. dot and the start and end anchors... I would also use a non-greedy require any character. .+?

    Otherwise you could match this

    Seriesinformation
    

    if the casing of Series or information may change such as

    Series .... Information

    Add the /i flag as in

         preg_match_all('/(Series.+?information)/i', $str, $matches );
    

    The outer capture group isn't really needed, but I think it looks nicer with it in there, if you just want the variable content without the Series or Information then move the capture ( ) to that bit.

     preg_match_all('/Series(.+?)information/i', $str, $matches );
    

    Note you'll want to trim() the match because it will likely have spaces at the beginning and end or add them to the regx like this.

     preg_match_all('/Series\s(.+?)\sinformation/i', $str, $matches );
    

    But that will exclude matching Series information with one space.

    If you want to be sure you don't match over an information such as

    [Series Hell In Heaven information Series Hell In Heaven information]
    

    Matching all of that you can use a positive lookbehind

    preg_match_all('/(Series.+?(?<=information))/i', $str, $matches );
    

    Conversely, if there is a possibility it will contain two information words

       <a href="http://example.com/123">
            Series information is power information
       </a>
    

    You can do this

        preg_match_all('/(Series[^<]+)</i', $str, $matches );
    

    Which will match up to the < as in </a

    AS a Side note you could use the PHPQuery library ( which is a DOM parser ), and look for an a tag that contains those words.

    https://github.com/punkave/phpQuery

    And

    https://code.google.com/archive/p/phpquery/wikis/Manual.wiki

    Using something like

      $tags = $doc->getElementsByTagName("a:contains('Series)")->text();
    

    This is an excellent library for parsing HTML