Search code examples

PHP str_replace scraped content with wild card?

I'm looking for a solution to strip some HTML from a scraped HTML page. The page has some repetitive data I would like to delete so I tried with preg_replace() to delete the variable data.

Data I want to strip:

Producent:<td class="datatable__body__item" data-title="Producent">Example
Groep:<td class="datatable__body__item" data-title="Produkt groep">Example1
Type:<td class="datatable__body__item" data-title="Produkt type">Example2

Must be like this afterwards:


So a big piece is the same except the word within the data-title piece. How could I delete this piece of data?

I tried a few things like this one:

$pattern = '/<td class=\"datatable__body__item\"(.*?)>/';
$tech_specs = str_replace($pattern,"", $tech_specs);

But that didn't work. Is there any solution to this?


  • Well maybe my question wasn't that good written. I had a table which I needed to scrape from a website. I needed the info in the table, but had to cleanup some parts as mentioned. The solution I finally made was this one and it works. It still has a little work to do with manual replacements but that is because of the stupid " they use for inch. ;-)


       \\ find the table in the sourcecode
       foreach($techdata->find('table') as $table){
        \\ filter out the rows
        foreach($table->find('tr') as $row){
        \\ take the innertext using simplehtmldom
        $tech_specs = $row->innertext;
        \\ strip some 'garbage'
        $tech_specs = str_replace("  \t\t\t\t\t\t\t\t\t\t\t<td class=\"datatable__body__item\">","", $tech_specs);
        \\ find the first word of the string so I can use it    
        $spec1 = explode('</td>', $tech_specs)[0];
        \\ use the found string to strip down the rest of the table
        $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"" . $spec1 . "\">",":", $tech_specs);
        \\ manual correction because of the " used
        $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"tbv Montage benodigde 19\">",":", $tech_specs);
        \\ manual correction because of the " used
        $tech_specs = str_replace("<td class=\"datatable__body__item\" data-title=\"19\">",":", $tech_specs);
        \\ strip some 'garbage'
        $tech_specs = str_replace("\t\t\t\t\t\t\t\t\t\t","\n", $tech_specs);
        $tech_specs = str_replace("</td>","", $tech_specs);
        $tech_specs = str_replace("  ","", $tech_specs);
        \\ put the clean row in an array ready for usage
        $specs[] = $tech_specs;