I am trying to retrieve the content from a .gz file that contains a collection of html documents, it's a file from the GOV2 collection. Each pages is separated by the tag, and each tag contains several meta informations, among them the id of the document and (or ), its content. Here is an exemple of such a file:
<doc>
<docno>GX000-xx-xxxxxxx</docno>
<dochdr>
<!-- no relevant meta info -->
</dochdr>
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 3.0//EN">
<html>
<!-- the content I want to extract -->
</html>
</doc>
<doc>
<docno>GX000-xx-xxxxxxy</docno>
<dochdr>
<!-- no relevant meta info -->
</dochdr>
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 3.0//EN">
<html>
<!-- another content I want to extract -->
</html>
</doc>
I need to have a list containing each docno
and each content of html tags separately.
Here is what I've done using BeautifulSoup:
doc_file = 'xx.gz'
f = gzip.open(data_dir + doc_file)
doc_string = f.read()
f.close()
soup = BeautifulSoup(doc_string, "html.parser")
doc_list = soup.select('DOC')
doc_no = []
doc_content = []
for doc in doc_list:
doc_no.append(doc.find('docno').get_text())
doc_raw = doc.find('html')
if doc_raw is None: #It's possible a doc has no html tag
doc_content.append('<null/>')
else:
doc_content.append(re.sub(r'(\n\s*)+\n+', '\n', doc.find('html').get_text()))
This works, but html.parser is a very slow parser (about 4 min per file, but I have several thousands to scrape from...). Thankfully, it's almost instant using another parser like lxml
. However such a parser, for whatever reason, removes the <html>
tags. I've tried an alternate way where I replaced these tags in doc_string (using doc_string=doc_string.replace(b'<html>', b'<2html>'
) before calling BeautifulSoup but:
the process is very slow
for whatever reason the <
are transformed into <
, to unescape it I found no easier way than decoding doc_string
, unescape it, then re-encode it which is ridiculous time-wise. Even replacing directly b'html'
per b'2html'
seems to escape the <
and >
Do you have a faster way to do such a task ?
Thank you for your help.
As said in my post, I thought converting the document to string, then replacing html
tags and then re-encode the string to bytes would be too long. Turns out I was wrong.
The strategy I used after figuring this out is to replace EVERY html
occurence (not only tags) by another, unique word (like here in the following with htmltag
). Then once I scraped the content of a htmltag
, I replaced each remaining occurence of htmltag
back to html
. That way the content is not altered at all.
f = gzip.open(data_dir + doc_file)
doc_string = f.read()
f.close()
doc_string_str = doc_string.decode(errors='ignore')
doc_string_str = doc_string_str.replace('html', 'htmltag')
doc_string = doc_string_str.encode()
soup = BeautifulSoup(doc_string, "lxml")
doc_list = soup.select('DOC')
doc_no = []
doc_content = []
for doc in doc_list:
doc_no.append(doc.find('docno').get_text())
doc_raw = doc.find('htmltag')
if doc_raw is None: #It's possible a doc has no html tag
doc_content.append('<null/>')
else:
doc_content.append(re.sub(r'(\n\s*)+\n+', '\n', doc.find('htmltag').get_text()).replace('htmltag', 'html'))
Thank you for @shellter and @JL_Peyret for the help, I basically followed what you told me but directly in Python. It know takes about 15 seconds per documents.