Search code examples
phphtmlhtmlspecialchars

How to "use" HTML special chars in PHP function like strpos?


I have the problem that I can't use php string functions like substr, strpos etc with the HTML special chars like the middledot.

My specific problem:

$tdp = gettexts('TDP: ' , '•' , $complete_info);

Function giving me a text fragment:

function gettexts ($startst, $endst, $content){
    $first_step = explode($startst , $content);
    $second_step = explode($endst , $first_step[1]);
    $textst= $second_step[0];
    return $textst;
}

Doesn't work. How can I fix this?

EDIT: It works when I test it with this code:

$turbo = gettexts('Turbo: ' , '•' , 'Turbo: 4.70GHz • TDP: 220W • Fertigung: 32nm •');

This is the page I want to read out: http://skinflint.co.uk/intel-core-i7-6700t-cm8066201920202-a1261888.html

Here a complete code for testing. The result for turbo frequency should be 3.60 (And I can't use the Ghz, because sometimes its Turbo: N/A and I really want to use the dots for exploding ;)

<?php
$content = file_get_contents('http://geizhals.eu/intel-core-i7-6700t-cm8066201920202-a1261888.html');
$complete_info= strip_tags(gettexts('<div id="gh_proddesc">' ,'Gelistet seit:' , $content));
var_dump($complete_info);
echo '<br><br>';
function gettexts ($startst, $endst, $content){
    $first_step = explode($startst , $content);
    $second_step = explode($endst , $first_step[1]);
    $textst= $second_step[0];
    return $textst;
}
echo 'Frequency:'. $frequency = gettexts('Taktfrequenz: ' , 'GHz' , $complete_info);
echo '<br>';
echo 'Turbo-Frequency:'.$turbo = gettexts('Turbo: ' , '•' , $complete_info);
?>

I didn't find a code sharing site what would allow URL reading, but http://phpfiddle.org/ allows it (no sharing).


Solution

  • Edited:

    So you're scrapping a page and want to extract some info. My previous code works if you copy-paste, but to fetch a webpage there are encoding issues (that page is cp1252 encoded but no header).

    You should be parsing the dom (after fixing the encoding header) and using xpath to extract the content... but for the sake of a quick fix based on your code, just remove the strip_tags and use my function.

    look at the source before and after downloading the page and you will notice that if you use strip_tags, the htm entities are gone.

    This will work:

    function gettexts ($startst, $endst, $content){
        $first_step = explode(html_entity_decode($startst) , html_entity_decode($content));
        $second_step = explode(html_entity_decode($endst), $first_step[1]);
        $textst= $second_step[0];
        return $textst;
    }
    
    $content = file_get_contents('http://geizhals.eu/intel-core-i7-6700t-cm8066201920202-a1261888.html');
    $string = gettexts('<div id="gh_proddesc">' ,'Gelistet seit:' , $content);
    
    echo 'Frequency:'. $frequency = gettexts('Taktfrequenz: ' , 'GHz' , $string);
    echo '<br>';
    echo 'Turbo-Frequency:'.$turbo = gettexts('Turbo: ' , '&#149;' , $string);