I'm loading a remote file with PHP, and then trying to parse it with DomDocument
. The file contains HTML, CSS (inside a style
tag), and JavaScript (inside a script
tag). Then I load it by separately by passing html
or css
or js
into the function that is parsing it. The idea is that I can use core WordPress methods to display these in the proper locations.
This is the closest I've managed to get:
libxml_use_internal_errors( true );
$document = wp_remote_retrieve_body( $response ); // this is the remote HTML file
// create a new DomDocument object
$html = new DOMDocument( '1.0', 'UTF-8' );
// load the HTML into the DomDocument object (this would be your source HTML)
$html->loadHTML( $document, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
if ( 'html' === $part ) {
$xpath = new DOMXPath( $html );
$remove = $xpath->query( "//*[style or script]" );
foreach ( $remove as $node ) {
$node->parentNode->removeChild($node);
}
} elseif ( 'css' === $part ) {
$xpath = new DOMXPath( $html );
$remove = $xpath->query( "//*[not(self::style)]" );
foreach ( $remove as $node ) {
$node->parentNode->removeChild($node);
}
} elseif ( 'js' === $part ) {
$xpath = new DOMXPath( $html );
$remove = $xpath->query( "//*[not(self::script)]" );
foreach ( $remove as $node ) {
$node->parentNode->removeChild($node);
}
}
ob_start();
echo $html->saveHTML();
$output = ob_get_contents();
ob_end_clean();
This results in a few problems:
style
or script
tag, and I'm trying to figure out how to remove it.<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><head></head><body>
and I'd like to remove that as well.I'm not sure if I need to take this in another direction, or if I just need a small thing to remove these wrapping elements. But I had a lot of trouble getting xpath
to relate to the elements I want to keep, rather than the ones I want to remove, and that's how I've ended up where I am.
For your html
case, instead of saving the whole DOMDocument, you can save just the <body>
element.
libxml_use_internal_errors( true );
$document = wp_remote_retrieve_body( $response ); // this is the remote HTML file
// create a new DomDocument object
$html = new DOMDocument( '1.0', 'UTF-8' );
// load the HTML into the DomDocument object (this would be your source HTML)
$html->loadHTML( $document, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
if ( 'html' === $part ) {
// get all <body> elements
$body_elements = $html->getElementsByTagName( 'body' );
// it is to be assumed that there is only one <body> element.
$body = $body_elements->item( 0 );
// get the HTML contained within that body element
$output = $body->ownerDocument->saveHTML( $body );
} else {
// ...
}
For the CSS and JS elements, I'm not sure why you'd need to get just their inner contents without the containing tag, but a similar approach to what we just did with $body
would work: 1. select the elements, 2. foreach
loop over the array of elements, 3. get each element's saved insides (I believe but am not certain this will be a DOMText
object) and concatenate those strings to create your eventual $output
variable.
An alternate approach for CSS and JS: take your existing approach's cluster of <script>
or <tag>
elements, insert them into a blank DOMDocument
's <head>
to save their containing <head>
as an HTML string, and then enqueue that string via an anonymous function on WordPress' wp_enqueue_scripts
hook:
/**
* https://stackoverflow.com/questions/66361476/separate-html-css-and-javascript-from-file-with-domdocument?newreg=231eb52469c14d8c9c45ee9969df031a
*/
function wpse_66361476_alert() {
$output = "<script>alert('hello');</script>"; // demonstration content
add_action(
'wp_enqueue_scripts',
function() use ($output) {
echo $output;
}
);
}
add_action('init', 'wpse_66361476_alert');
That approach is dangerous if you don't control the CSS and JS (and HTML) that you're outputting. It may be better to iframe in whatever you're loading here.
To improve page load speed if your host is not already using a frontend cache, you may want to look into caching the parsed elements using WordPress' caching functions. Here's a short overview; talk to your hosting provider to see if there's specific advice they have.