I have a text file where I have some XML data and some HTML data. Both start with "<". Now I want to extract only XML data and save it in another file. How can I do it?
File example:
xyz data:
<note>
<to>john</to>
<from>doe</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
xyz data
<bold>xyz</bold>
text
text
text
<bold>xyz</bold>
again XML data
Note: This file is in .txt format.
I would treat your whole input not as XML, but as an HTML fragment. HTML can contain non-standard elements, so <note>
etc. is fine.
For convenience I suggest pyquery
(link) to deal with HTML. It works pretty much the same way as jQuery, so if you've worked with that before, it should be familiar.
It's pretty straight-forward. Load your data, wrap it in "<html></html>"
, parse it, query it.
from pyquery import PyQuery as pq
data = """xyz data:
<note>
<to>john</to>
<from>doe</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
xyz data
<bold>xyz</bold>
text
text
text
<bold>xyz</bold>
again XML data"""
doc = pq(f"<html><body>{data}</body></html>")
note = doc.find("note")
print(note.find("body").text())
which prints "Don't forget me this weekend!"
.