I know it's better to use DOM for this purpose but let's try to extract the text in this way:
<p>Some text</p>
preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);
if (empty($matches))
$matched_body_start_tag = $matches[0][0];
$index_of_body_start_tag = $matches[0][1];
$index_of_body_end_tag = strpos($html, '</body>');
$body = substr(
$index_of_body_start_tag + strlen($matched_body_start_tag),
$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
echo $body;
The result can be seen here: http://ideone.com/vH2FZ
As you can see, I am getting more text than expected.
There is something I don't understand, to get the correct length for the substr($string, $start, $length)
function, I am using:
$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
I don't see anything wrong with this formula.
Could somebody kindly suggest where the problem is?
Many thanks to you all.
Thank you very very much to all of you. There is just a bug in my brain. After reading your answers, I now understand what the problem is, it should either be:
$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));
$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);
The problem is that your string have new lines where . in the pattern only matches single lines, you need to add /s modifier to make . to match multi-lines
Here is my solution, I prefer it this way.
<body buu="grger" ga="Gag">
<p>Some text</p>
// get anything between <body> and </body> where <body can="have_as many" attributes="as required">
if (preg_match('/(?:<body[^>]*>)(.*)<\/body>/isU', $html, $matches)) {
$body = $matches[1];
// outputing all matches for debugging purposes
Edit: I am updating my answer to provide you with better explanation why your code fails.
You have this string:
<p>Some text</p>
Everything seems to be fine with it but actually you have non-print characters (new line characters) on each line. You have 53 printable characters and 7 non printable (new lines, \n == 2 characters actually for each new line).
When you reach this part of the code:
$index_of_body_end_tag = strpos($html, '</body>');
You get the correct position of </body> (starting at position 51) but this counts the new lines.
So when you reach this line of code:
$index_of_body_start_tag + strlen($matched_body_start_tag)
It it evaluated to 31 (new lines included), and:
$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
It is evaluated to 51 - 25 + 6 = 32 (characters you have to read) but you only have 16 printable characters of text between <body> and </body> and 4 non printable characters (new line after <body> and new line before </body>). And here is the problem, you have to group the calculation (prioritize) like so:
$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))
evaluated to 51 - (25 + 6) = 51 - 31 = 20 (16 + 4).
:) Hope this helps you to understand why prioritizing is important. (Sorry for misleading you about newlines it is only valid in regex example I gave above).