I have crawled a lot of htmls(with similar content) from a lot of sites by Scrapy, while the dom structure are different.
For example, one of the sites use the following structure:
<div class="post">
<section class='content'>
Content1
</section>
<section class="panel">
</section>
</div>
<div class="post">
<section class='content'>
Conent2
</section>
<section class="panel">
</section>
</div>
And I want to extract the data Content
and Content2
.
While another site may use structure like this:
<article class="entry">
<section class='title'>
Content3
</section>
</article>
<article class="entry">
<section class='title'>
Conent4
</section>
</article>
And I want to extract the data Content3
and Content4
.
While the easiest solution is marking the required data xpath one by one for all the sites. That would be a tedious job.
So I wonder if the structure can be extracted automatically. In fact, I just need to be located to the repeated root node(div.post
and article.entry
in the above example), then I can extract the data with some certain rules.
Is this possible?
BTW, I am not exactly sure the name of this kind of algorithms, so the tag of this post maybe wrong, feel free to modify that if true.
You have to know at least some common patterns to be able to formulate deterministic extraction rules. The solution below is very primitive and by no means optimal, but it might help you:
# -*- coding: utf-8 -*-
import re
import bs4
from bs4 import element
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
min_occurs = 5
max_occurs = 1000
min_depth = 7
max_depth = 7
pattern = re.compile('^/html/body/.*/(span|div)$')
extract_content = lambda e: e.css('::text').extract_first()
#extract_content = lambda e: ' '.join(e.css('*::text').extract())
doc = bs4.BeautifulSoup(response.body, 'html.parser')
paths = {}
self._walk(doc, '', paths)
paths = self._filter(paths, pattern, min_depth, max_depth,
min_occurs, max_occurs)
for path in paths.keys():
for e in response.xpath(path):
yield {'content': extract_content(e)}
def _walk(self, doc, parent, paths):
for tag in doc.children:
if isinstance(tag, element.Tag):
path = parent + '/' + tag.name
paths[path] = paths.get(path, 0) + 1
self._walk(tag, path, paths)
def _filter(self, paths, pattern, min_depth, max_depth, min_occurs, max_occurs):
return dict((path, count) for path, count in paths.items()
if pattern.match(path) and
min_depth <= path.count('/') <= max_depth and
min_occurs <= count <= max_occurs)
It works like this:
For building the dictionary of paths I just walk though the document using BeautifulSoup
and count occurence of each element path. This can later be used in filtering task for keeping only the most repearing paths.
Next I filter out the paths based on some basic rules. For path to be kept, it has to:
min_occurs
and at most max_occurs
times.min_depth
and at most max_depth
.pattern
.Other rules can be added in similar fashion.
The last part loops through the paths that left you after filtering and extracts content from elements using some common logic defined using extract_content
.
If your web pages are rather simple and you could infer such rules, it might work. Otherwise, you would have to look at some kind of machine learning task I guess.