I want to find title of pages from a huge haystack but that do not have any class or unique id, so i can't use DOM parser here, i am aware i must use regular expressions. Here is example of what i am trying to find:
<a href="http://example.com/xyz">
Series Hell In Heaven information
</a>
<a href="http://example.com/123">
Series What is going information
</a>
Output Should be an array with
[0] => Series Hell In Heaven information
[1] => Series What is going information
All series titles have start with Series and end with information. from a huge string of multiple things i only want to extract titles. Currently i am trying to use a regex but its not working, here's what i am doing right now.
$reg = "/^Series\..*information$/";
$str = $html;
preg_match_all($reg, $str, $matches);
echo "<pre>";
print_r($matches);
echo "</pre>";
I don't know much about making regular expressions. Help would appreciated. Thanks
Try
preg_match_all('/(Series.+?information)/', $str, $matches );
As
https://regex101.com/r/oJ0jZ4/1
As I said in the comments, remove the literal \.
dot and the start and end anchors... I would also use a non-greedy require any character. .+?
Otherwise you could match this
Seriesinformation
if the casing of Series or information may change such as
Series .... Information
Add the /i
flag as in
preg_match_all('/(Series.+?information)/i', $str, $matches );
The outer capture group isn't really needed, but I think it looks nicer with it in there, if you just want the variable content without the Series or Information then move the capture ( )
to that bit.
preg_match_all('/Series(.+?)information/i', $str, $matches );
Note you'll want to trim()
the match because it will likely have spaces at the beginning and end or add them to the regx like this.
preg_match_all('/Series\s(.+?)\sinformation/i', $str, $matches );
But that will exclude matching Series information
with one space.
If you want to be sure you don't match over an information such as
[Series Hell In Heaven information Series Hell In Heaven information]
Matching all of that you can use a positive lookbehind
preg_match_all('/(Series.+?(?<=information))/i', $str, $matches );
Conversely, if there is a possibility it will contain two information words
<a href="http://example.com/123">
Series information is power information
</a>
You can do this
preg_match_all('/(Series[^<]+)</i', $str, $matches );
Which will match up to the <
as in </a
AS a Side note you could use the PHPQuery library ( which is a DOM parser ), and look for an a
tag that contains those words.
https://github.com/punkave/phpQuery
And
https://code.google.com/archive/p/phpquery/wikis/Manual.wiki
Using something like
$tags = $doc->getElementsByTagName("a:contains('Series)")->text();
This is an excellent library for parsing HTML