Search code examples
pythonxmldomminidom

How can parse the url when encountering error "xml.parsers.expat.ExpatError: mismatched tag"?


I want to extract all links in element DOCUMENT in the webpage:

target to be parsed

import urllib.request
url = 'https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/0001326801-20-000013-index-headers.html'
ob=urllib.request.urlopen(url).read()
from xml.dom import minidom
xmldoc = minidom.parseString(ob)

It encounters issue:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/xml/dom/minidom.py", line 1968, in parseString
    return expatbuilder.parseString(string)
  File "/usr/lib/python3.5/xml/dom/expatbuilder.py", line 925, in parseString
    return builder.parseString(string)
  File "/usr/lib/python3.5/xml/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: mismatched tag: line 876, column 23

Maybe it is a bad-formated xml file ,how to load it with minidom?


Solution

  • I have no idea what this file is, but it's not XML, and it can't be parsed using an XML parser.