I want to extract all links in element DOCUMENT
in the webpage:
import urllib.request
url = 'https://www.sec.gov/Archives/edgar/data/1326801/000132680120000013/0001326801-20-000013-index-headers.html'
ob=urllib.request.urlopen(url).read()
from xml.dom import minidom
xmldoc = minidom.parseString(ob)
It encounters issue:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.5/xml/dom/minidom.py", line 1968, in parseString
return expatbuilder.parseString(string)
File "/usr/lib/python3.5/xml/dom/expatbuilder.py", line 925, in parseString
return builder.parseString(string)
File "/usr/lib/python3.5/xml/dom/expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: mismatched tag: line 876, column 23
Maybe it is a bad-formated xml file ,how to load it with minidom?
I have no idea what this file is, but it's not XML, and it can't be parsed using an XML parser.