python behave generates XML that is "Not well-formed"

I am using python behave for about 40 tests that I run. Now I am trying to make a more-or-less decent looking HTML report for myself and my client.

I run the tests via commandline: behave --junit. Next I take the xml, parse it (elementtree) and write an html file.

I have managed to do this, basically, except that I have to manually edit the xml because it has some weird characters in it. It seems to me, that those characters really shouldn't be there, and also trying to just ignore them (using recover=true, as mentioned ParseError: not well-formed (invalid token) using cElementTree for instance) did not work. (Without it it gives me a message about "not well-formed (invalid token)" and with the recover option it just ignores everything after the strange characters, resulting in a very short test report)

Is there something I am missing? Maybe something in the organisation or execution of my behave tests that makes results in this broken XML?

Maybe just learning what characters they are, so I try and write code to just replace or delete them, before parsing would be helpful.

Any help is appreciated!

Cheerz,

Chai

Here is a piece of the XML with those strange characters: I see that qouting it here already makes it show differently so I added a screenshot of sublimetext as well.

<testcase classname="screenshots.Features.Aanvraagformulier.Aanvraagformulier" name="Test 02 Veld validatie checken voor enkel veld zakelijke aanvraag" status="failed" time="79.278"><error message="Message: Time out bij t wachten op element met css of element niet gevonden: #pa..." type="NoSuchElementException">
<![CDATA[
Failing step: Given Dat ik ingelogd ben als aanvrager ... failed in 79.278s
Location: Features\Aanvraagformulier.feature:98
Traceback (most recent call last):
  File "c:\python27\lib\site-packagesehave\model.py", line 1456, in run
    match.run(runner.context)
  File "c:\python27\lib\site-packagesehave\model.py", line 1903, in run
    self.func(context, *args, **kwargs)
  File "D:\Chai_Testspul\PythonScripts\sigmaspul\Featureslgemeen\general_steps.py", line 57, in dat_ik_ingelogd_ben
    login(context, email, password)
  File "D:\Chai_Testspul\PythonScripts\sigmaspul\Featureslgemeen\page_commands.py", line 18, in login
    wait_for_css(context.driver, '#passwordInput')
  File "D:\Chai_Testspul\PythonScripts\sigmaspul\Featureslgemeen\page_commands.py", line 44, in wait_for_css
    raise NoSuchElementException('Time out bij t wachten op element met css of element niet gevonden: ' + css)
NoSuchElementException: Message: Time out bij t wachten op element met css of element niet gevonden: #passwordInput

]]>
</error>

Solution

This looks like a bug somewhere. See how BS in your output is where you'd expect \b and where BEL is where you'd expect \a. The issue is that the backslash + letter combinations are interpreted as control sequences whereever possible.

Here's an interactive Python session that illustrates what happens:

>>> print "a\bc\qd"
c\qd

\b is interpreted as a backspace and thus c overwrites a. (You have a terminal that puts out BS instead.) \q passes as-is because \q does not form a meaningful control sequence.

Now, see this:

>>> print r"a\bc\qd"
a\bc\qd

If you use r"", then everything passes through.

You could work around it by replacing all these control characters by what they should be. Then the XML would be fine.

Ultimately, though, the bug should be fixed at its source. Maybe a library that Behave depends on is buggy, or something you use to process Behave's output, or Behave itself.