I have a file as index.html and there are div tags in that file, I am trying to fetch content from all the div tags in the html page, but i am getting the content from only first div tag, I need content from all the div present in html page.
here is my code:
<?php
// Function to get the contents of an attribute of an HTML tag
function get_attribute_contents($element) {
$obj_attribute = array ();
foreach ( $element->attributes as $attribute ) {
$obj_attribute [$attribute->name] = $attribute->value;
}
return $obj_attribute;
}
// Function to get contents of a child element of an HTML tag
function get_child_contents($element) {
$obj_child = array ();
foreach ( $element->childNodes as $subElement ) {
if ($subElement->nodeType != XML_ELEMENT_NODE) {
if (trim ( $subElement->wholeText ) != "") {
$obj_child ["value"] = $subElement->wholeText;
}
} else {
if ($subElement->getAttribute ( 'id' )) {
$obj_child [$subElement->tagName . "#" . $subElement->getAttribute ( 'id' )] = get_tag_contents ( $subElement );
} else {
$obj_child [$subElement->tagName] = get_tag_contents ( $subElement );
}
}
}
return $obj_child;
}
// Function to get the contents of an HTML tag
function get_tag_contents($element) {
$obj_tag = array ();
if (get_attribute_contents ( $element )) {
$obj_tag ["attributes"] = get_attribute_contents ( $element );
}
if (get_child_contents ( $element )) {
$obj_tag ["child_nodes"] = get_child_contents ( $element );
}
return $obj_tag;
}
// Function to convert a DOM element to an object
function element_to_obj($element) {
$object = array ();
$tag = $element->tagName;
$object [$tag] = get_tag_contents ( $element );
return $object;
}
// Function to convert an HTML to a DOM element
function html_to_obj($html) {
$dom = new DOMDocument ();
$dom->loadHTML ( $html );
$docElement = $dom->documentElement;
return element_to_obj ( $dom->documentElement );
}
// Reading the contents of an HTML file
$html = file_get_contents ( 'index.html' );
header ( "Content-Type: text/plain" );
// Coverting the HTML to JSON
$output = json_encode ( html_to_obj ( $html ) );
// Writing the JSON output to an external file
$file = fopen ( "js_output.json", "w" );
fwrite ( $file, $output );
fclose ( $file );
echo "HTML to JSON conversion has been completed.\n";
echo "Please refer to json_output.json to view the JSON output.";
?>
and the html file is:
<div class="issue-message">
Rename this package name to match the regular expression
'^[a-z]+(\.[a-z][a-z0-9]*)*$'.
<button class="button-link issue-rule icon-ellipsis-h little-spacer-left" aria-label="Rule Details"></button>
</div>
<div class="issue-message">
Replace this use of System.out or System.err by a logger.
<button class="button-link issue-rule icon-ellipsis-h little-spacer-left" aria-label="Rule Details"></button>
</div>
<div class="issue-message">
Replace this use of System.out or System.err by a logger.
<button class="button-link issue-rule icon-ellipsis-h little-spacer-left" aria-label="Rule Details"></button>
</div>
<div class="issue-
message">
Rename this package name to match the regular expression '^[a-z]+
(\.[a-z][a-z0-9]*)*$'.
<button
class="button-link issue-rule icon-ellipsis-h little-spacer-left" aria-label="Rule Details"></button>
</div>
<div class="issue-message">
Replace this use of System.out or System.err by a logger.
<button class="button-link issue-rule icon-ellipsis-h little-spacer-left" aria-label="Rule Details"></button>
</div>
as the output of the code on the following file i am getting the json conversion of the content present in only first div tag as:
{
"html": {
"child_nodes": {
"body": {
"child_nodes": {
"p": {
"child_nodes": {
"value": "Issues found:"
}
},
"div": {
"attributes": {
"class": "issue-message"
},
"child_nodes": {
"value": "This block of commented-out lines of code should be removed.",
"button": {
"attributes": {
"class": "button-link issue-rule icon-ellipsis-h little-spacer-left",
"aria-label": "Rule Details"
}
}
}
}
}
}
}
}
}
The Reason that you see only one DIV element is that you are creating an associative array that its elements (in your case the DIVs) are being overwritten when iterating over the DIVs elements since they are on the same tree level.
Your code is a mess and I think it's to much for something that simple. Here is my version of your code - parsing HTML DOM element into an associative PHP array:
Note: to overcome the overwriting of the same elements I'm simply pushing the children into an indexed array and storing the tagname as an element.
A simple recursive approach (packed into a static class):
You can see a working example here
<?php
class DomToArray {
/* Method to get the contents of the attributes
* @param $element -> Object DomElement
* @return Array
*/
private static function get_attribute_contents($element) {
$obj_attribute = [];
if ($element->hasAttributes()) {
foreach ( $element->attributes as $attribute ) {
$obj_attribute [$attribute->name] = $attribute->value;
}
}
return $obj_attribute;
}
/* Recursive method to walk the DOM tree and Extract the metadata we need
* @param $element-> Object DomElement
* @param &$tree-> Array Element
* @param $text -> String || null
* @return Array
*/
private static function get_tag_contents($element, &$tree, $text = null) {
//The node representation in our json model
$tree = array(
"tagName" => ($element->nodeType === 1 ? $element->tagName : $element->nodeName),
"nodeType" => $element->nodeType,
"attributes" => self::get_attribute_contents($element),
"value" => $text,
"child_nodes" => []
);
// iterate over children and Recursively parse them:
if ($element->hasChildNodes()) {
foreach ($element->childNodes as $subElement) {
$text = null;
if ($subElement->nodeType === 3) {
$text = trim(preg_replace('/\s+/', ' ', $subElement->textContent)); //Removes also \r \n
if (empty($text)) continue; //Jump over empty text elements.
}
self::get_tag_contents($subElement, $tree["child_nodes"][], $text);
}
}
}
/* Main Method to convert an HTML string to an Array of nested elements that represents the DOM tree.
* @param &$html -> String
* @return Array
*/
public static function html_to_obj(&$html) {
$dom = new DOMDocument ();
$dom->loadHTML($html);
$tree = [];
self::get_tag_contents($dom->documentElement, $tree);
return $tree;
}
}
Now consider this Program and input:
$source = "
<div class=\"issue-message\">
Rename this package name to match the regular expression
'^[a-z]+(\.[a-z][a-z0-9]*)*$'.
<button class=\"button-link issue-rule icon-ellipsis-h little-spacer-left\" aria-label=\"Rule Details\"></button>
</div>
<div class=\"issue-message\">
Replace this use of System.out or System.err by a logger.
<button class=\"button-link issue-rule icon-ellipsis-h little-spacer-left\" aria-label=\"Rule Details\"></button>
</div>
";
$array_tree = DomToArray::html_to_obj($source);
echo json_encode($array_tree);
The output will be:
{
"tagName": "html",
"nodeType": 1,
"attributes": [],
"value": null,
"child_nodes": [
{
"tagName": "body",
"nodeType": 1,
"attributes": [],
"value": null,
"child_nodes": [
{
"tagName": "div",
"nodeType": 1,
"attributes": {
"class": "issue-message"
},
"value": null,
"child_nodes": [
{
"tagName": "#text",
"nodeType": 3,
"attributes": [],
"value": "Rename this package name to match the regular expression '^[a-z]+(\\.[a-z][a-z0-9]*)*$'.",
"child_nodes": []
},
{
"tagName": "button",
"nodeType": 1,
"attributes": {
"class": "button-link issue-rule icon-ellipsis-h little-spacer-left",
"aria-label": "Rule Details"
},
"value": null,
"child_nodes": []
}
]
},
{
"tagName": "div",
"nodeType": 1,
"attributes": {
"class": "issue-message"
},
"value": null,
"child_nodes": [
{
"tagName": "#text",
"nodeType": 3,
"attributes": [],
"value": "Replace this use of System.out or System.err by a logger.",
"child_nodes": []
},
{
"tagName": "button",
"nodeType": 1,
"attributes": {
"class": "button-link issue-rule icon-ellipsis-h little-spacer-left",
"aria-label": "Rule Details"
},
"value": null,
"child_nodes": []
}
]
}
]
}
]
}
Hope I helped you.