Search code examples
pythonxmlbeautifulsoup

Include newline characters between nested XML Tags (BeautifulSoup)


I'd like to test that some (de)serialization code is 100% symmetrical but I've found that, when serializing my data as a nested Tag, the inner Tags don't have newline characters between them like the original Tag did.

Is there some way to force newline characters between nested Tags without having to prettify -> re-parse?

Example:

import bs4
from bs4 import BeautifulSoup

tag_string = "<OUTER>\n<INNER/>\n</OUTER>"
tag = BeautifulSoup(tag_string, "xml").find("OUTER")

print(tag)
... <OUTER>
... <INNER/>
... </OUTER>

new_tag = bs4.Tag(name="OUTER")
new_tag.append(bs4.Tag(name="INNER", can_be_empty_element=True))

tag == new_tag
... False

print(new_tag)
... <OUTER><INNER/></OUTER>

In order to get a 100% symmetrical serialization, I have to:

  • prettify the Tag into a string
  • replace "\n " with "\n"
  • parse this string as a new XML Tag
  • find the OUTER Tag
new_tag = BeautifulSoup(
    new_tag.prettify().replace("\n ", "\n"), "xml"
).find("OUTER")

tag == new_tag
... True

print(new_tag)
... <OUTER>
... <INNER/>
... </OUTER>

Justification for this need:

  • I'm deserializing thousands of tags into objects using some test data specific to me
  • as I developed this feature, I found that there were many mismatches between input / output for several of the attributes
  • mismatches are not acceptable because it causes the closed-source binary that reads this data to perform operations that should be avoided unless there's a genuine difference between that program's database and the entries in this XML
  • I was able to handle these mismatches for my test data, but I am not able to guarantee (de)serialization symmetry for all users of this code
  • as a result, I think it's best if the test suit enumerates tags for the users' data asserting that input == output and, to facilitate this, I've had to do this kinda annoying hackaround (prettify -> re-parse) which I'd like to avoid if possible

Solution

  • BeautifulSoup has a NavigableString element which can be used to represent an XML structure with newline characters in it without having to convert to a string and parse back into a bs4.element.Tag.

    Example:

    import bs4
    from bs4 import BeautifulSoup
    
    tag_string = "<OUTER>\n<INNER/>\n</OUTER>"
    tag = BeautifulSoup(tag_string, "xml").find("OUTER")
    
    new_tag = bs4.Tag(name="OUTER")
    new_tag.append(bs4.NavigableString("\n"))
    new_tag.append(bs4.Tag(name="INNER", can_be_empty_element=True))
    new_tag.append(bs4.NavigableString("\n"))
    
    tag == new_tag
    ... True