I have to parse externally provided XML that has attributes with line breaks in them. Using SimpleXML, the line breaks seem to be lost. According to another stackoverflow question, line breaks should be valid (even though far less than ideal!) for XML.
Why are they lost? [edit] And how can I preserve them? [/edit]
Here is a demo file script (note that when the line breaks are not in an attribute they are preserved).
PHP File with embedded XML
$xml = <<<XML
<?xml version="1.0" encoding="utf-8"?>
<Rows>
<data Title='Data Title' Remarks='First line of the row.
Followed by the second line.
Even a third!' />
<data Title='Full Title' Remarks='None really'>First line of the row.
Followed by the second line.
Even a third!</data>
</Rows>
XML;
$xml = new SimpleXMLElement( $xml );
print '<pre>'; print_r($xml); print '</pre>';
Output from print_r
SimpleXMLElement Object
(
[data] => Array
(
[0] => SimpleXMLElement Object
(
[@attributes] => Array
(
[Title] => Data Title
[Remarks] => First line of the row. Followed by the second line. Even a third!
)
)
[1] => First line of the row.
Followed by the second line.
Even a third!
)
)
The entity for a new line is
. I played with your code until I found something that did the trick. It's not very elegant, I warn you:
//First remove any indentations:
$xml = str_replace(" ","", $xml);
$xml = str_replace("\t","", $xml);
//Next replace unify all new-lines into unix LF:
$xml = str_replace("\r","\n", $xml);
$xml = str_replace("\n\n","\n", $xml);
//Next replace all new lines with the unicode:
$xml = str_replace("\n"," ", $xml);
Finally, replace any new line entities between >< with a new line:
$xml = str_replace("> <",">\n<", $xml);
The assumption, based on your example, is that any new lines that occur inside a node or attribute will have more text on the next line, not a <
to open a new element.
This of course would fail if your next line had some text that was wrapped in a line-level element.