Search code examples
pythonxmlxml-parsingetldata-extraction

get only xml data from text file using python


I have a text file where I have some XML data and some HTML data. Both start with "<". Now I want to extract only XML data and save it in another file. How can I do it?

File example:

xyz data:
<note>
<to>john</to>
<from>doe</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

xyz data
<bold>xyz</bold>

text 
text 
text

<bold>xyz</bold>

again XML data

Note: This file is in .txt format.


Solution

  • I would treat your whole input not as XML, but as an HTML fragment. HTML can contain non-standard elements, so <note> etc. is fine.

    For convenience I suggest pyquery (link) to deal with HTML. It works pretty much the same way as jQuery, so if you've worked with that before, it should be familiar.

    It's pretty straight-forward. Load your data, wrap it in "<html></html>", parse it, query it.

    from pyquery import PyQuery as pq
    
    data = """xyz data:
    <note>
    <to>john</to>
    <from>doe</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
    </note>
    
    xyz data
    <bold>xyz</bold>
    
    text 
    text 
    text
    
    <bold>xyz</bold>
    
    again XML data"""
    
    doc = pq(f"<html><body>{data}</body></html>")
    note = doc.find("note")
    
    print(note.find("body").text())
    

    which prints "Don't forget me this weekend!".