Search code examples
pythonxmlhtml-entitiespython-2.5

How Can I Output Non-Escaped Element Tag In XML?


I have a Python script that I have inherited and my issue is that right now I have a chunk of text in a paragraph variable that contains anchor tags. For example:

This is text with a <a href="http://somewebsite.com">Link</a> in it.

What I'm required to do however is convert the anchor tags to the apxh name space so the above line should look something like this:

This is text with a <apxh:a href="http://somewebsite.com">Link</apxh:a> in it.

The problem is the way I have it above is outputting:

This is text with a &lt;apxh:a href=\"http://somewebsite.com;\"&gt;Link Text;&lt;/apxh:a&gt; in it.

My guess is that when I'm running the for loop on the paragraph, I need to some how find all anchor tags and text and do something like etree.Element("{%s}a" % nm["apxh"], nsmap=nm) but I'm not really sure

This is the current script:

def get_news_feed(request):
    articles = models.Article.objects.all().filter(distributable = True)

    nm = {
            None: "http://www.w3.org/2005/Atom",
            "ap": "http://ap.org/schemas/03/2005/aptypes",
            "apcm": "http://ap.org/schemas/03/2005/apcm",
            "apnm": "http://ap.org/schemas/03/2005/apnm",
            "apxh": "http://www.w3.org/1999/xhtml",
            }

    doc = etree.Element("{%s}feed" % nm[None], nsmap=nm)

    for article in articles:
        entry = etree.Element("{%s}entry" % nm[None], nsmap=nm)
        content = etree.Element("{%s}content" % nm[None], nsmap=nm)
        content.set("type", "xhtml")

        div = etree.Element("{%s}div" % nm["apxh"], nsmap=nm)
        for paragraph in article.body.replace("&amp;", "&").split("\n"):
            par = etree.Element("{%s}p" % nm["apxh"], nsmap=nm)
            par.text = paragraph            
            par.text = paragraph.replace("<a", "<apxh:a")            
            par.text = par.text.replace("</a", "</apxh:a")  
            par.text = cleanup_entities(par.text)
            div.append(par)
        content.append(div)
        entry.append(content)

        doc.append(entry)

    output = etree.tostring(doc, encoding="UTF-8", xml_declaration=True, pretty_print=True)
    return HttpResponse(output, mimetype="application/xhtml+xml")

This is how the output should look:

<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns:ap="http://ap.org/schemas/03/2005/aptypes" xmlns:apxh="http://www.w3.org/1999/xhtml" xmlns:apnm="http://ap.org/schemas/03/2005/apnm" xmlns:apcm="http://ap.org/schemas/03/2005/apcm" xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <content type="xhtml">
      <apxh:div>
        <apxh:p>This is some text</apxh:p>
        <apxh:p>This is text with a <apxh:a href="http://somewebsite.com">Link</apxh:a> in it.</apxh:p>
        <apxh:p>Theater</apxh:p>
      </apxh:div>
    </content>
  </entry>
</feed>

This is how the output currently looks:

<?xml version='1.0' encoding='UTF-8'?>
<feed xmlns:ap="http://ap.org/schemas/03/2005/aptypes" xmlns:apxh="http://www.w3.org/1999/xhtml" xmlns:apnm="http://ap.org/schemas/03/2005/apnm" xmlns:apcm="http://ap.org/schemas/03/2005/apcm" xmlns="http://www.w3.org/2005/Atom">
  <entry>
    <content type="xhtml">
      <apxh:div>
        <apxh:p>This is some text</apxh:p>
        <apxh:p>This is text with a &lt;apxh:a href=\"http://somewebsite.com;\"&gt;Link Text;&lt;/apxh:a&gt; in it.</apxh:p>
        <apxh:p>Theater</apxh:p>
      </apxh:div>
    </content>
  </entry>
</feed>

Solution

  • Don't inject your content as literal text -- render it into DOM content, with a namespace map that implicitly makes the default namespace the same one mapped to aphx:

    import lxml.etree as etree
    text='This is text with a <a href="http://somewebsite.com">Link</a> in it.'
    text_el = etree.fromstring('<root xmlns="http://www.w3.org/1999/xhtml">' + text + '</root>')
    

    ...then put the contents of text_el inside your par.

    Doing that might look like the following:

    par = etree.Element('{http://www.w3.org/1999/xhtml}div', nsmap=nm)
    par.text = text_el.text
    for child_el in text_el[:]:
      par.append(child_el)
    

    Because the nsmap nm is used above, converting this back to a string will honor the namespace prefixes contained therein, thus using apxh for content left in the default namespace (which we mapped with xmlns= inside the artificial root).


    In discussion in comments, it's come up that some of your production data looks like:

    u'John Doe: 360-555-4546; <a href=\\"mailto:[email protected];\\">John.mailto:[email protected]</a> twitter.com/JohnDoe'
    

    etree.fromstring() will throw an exception when given this input, because it isn't valid XML (or valid XHTML), on account of the backslashes.

    If you're quite sure that \" won't ever occur in valid input, you might consider:

    text_el = etree.fromstring(
      '<root xmlns="http://www.w3.org/1999/xhtml">' +
      text.replace('\\"', '"') +
      '</root>')