I have a TEI document containing characters encoded as &stern_1;
which are mapped in a separate Zeichen.dtd
(Document Type Definition) file. The file Zeichen.dtd
contains this:
<?xml version="1.0" encoding="UTF-8"?>
<!ENTITY stern_1 "✳" >
I am using BeautifulSoup4
and lxml-xml
as a parser.
Example:
dtd_str = '<!DOCTYPE Zeichen SYSTEM "Zeichen.dtd">'
xml_str = "<p>Hello, &stern_1;!</p>"
from bs4 import BeautifulSoup
soup = BeautifulSoup(dtd_str+xml_str, 'lxml-xml')
print(soup.find('p').get_text())
The code above prints this:
Hello, !
instead of this:
Hello, ✳!
I also tried inline DTD, with the same result:
dtd_str = """
<!DOCTYPE html [
<!ENTITY stern_1 "✳">
]>
"""
xml_str = "<p>Hello, &stern_1;!</p>"
from bs4 import BeautifulSoup
soup = BeautifulSoup(xml_str, 'lxml-xml')
print(soup.find('p').get_text())
output:
Hello, !
Any ideas?
Finally found a working solution to my own problem:
dtd_str = """
<!DOCTYPE html [
<!ENTITY stern_1 "✳">
]>
"""
xml_str = "<p>Hello, &stern_1;!</p>"
from lxml import etree
tree = etree.fromstring(dtd_str + xml_str)
from bs4 import BeautifulSoup
soup = BeautifulSoup(etree.tostring(tree, encoding='unicode'), "lxml-xml")
print(soup.find('p').get_text())
will print this:
Hello, ✳!
which is exactly what I wanted. The lxml library handles the dtd files correctly, whereas BeautifulSoup has a much nicer and more intuitive API when you need to walk through the tree.