Search code examples
javascriptphphtmlajaxweb-scraping

Remove all style attributes from HTML in PHP


I have to load the body of an HTML page without any style attribute and no link images and everything that is not 'plain text. I would like to do it in PHP and tried very solution but I have not solved. I load the html page with an ajax call to my script and then with a regular expression I take the body that then I want this cleared. Can you help me? This is the ajax call:

$.ajax({
       type: "GET"
       url: "core/proxy.php?url="+cerca,              
       success: function(data){
       var body = data.replace(/^[\S\s]*<body[^>]*?>/i, "")
       .replace(/<\/body[\S\s]*$/i, "");
        $("div#risultato").html(body);
    },
      error: function(){
      alert("failed");
    }
    });
});

Solution

  • Jose Antonio Riaza Valverde I corrected but nothing changes:

    $.ajax({
                //definisco il tipo della chiamata
                type: "GET",
                //url della risorsa da contattare
                url: "core/proxy.php?url="+cerca,
                //azione in caso di successo
                success: function(data)
                {
                    var body = data.replace(/^[\S\s]*<body[^>]*?>/i, "")
                    .replace(/<\/body[\S\s]*$/i, "");
                    $("div#risultato").html(body);
                    clearStyles(document.getElementById('risultato'));
    
                },
                //azione in caso di errore
                error: function()
                {
                    alert("Chiamata fallita");
                }
        });
    });
    

    and the function:

    function clearStyles(element) {
    element.setAttribute('style', ' ');
    element.setAttribute('img', ' ');
    element.setAttribute('a', ' ');
    for (var i = 0; i < element.children.length; i++) {
        clearStyles(element.children[i]);
    }
    

    }