Is it possible to find the nodes with same dom structure

I have crawled a lot of htmls(with similar content) from a lot of sites by Scrapy, while the dom structure are different.

For example, one of the sites use the following structure:

<div class="post">
    <section class='content'>
        Content1
    </section>

    <section class="panel">
    </section>
</div>
<div class="post">
    <section class='content'>
        Conent2
    </section>

    <section class="panel">
    </section>
</div>

And I want to extract the data Content and Content2.

While another site may use structure like this:

<article class="entry">
    <section class='title'>
        Content3
    </section>
</article>
<article class="entry">
    <section class='title'>
        Conent4
    </section>
</article>

And I want to extract the data Content3 and Content4.

While the easiest solution is marking the required data xpath one by one for all the sites. That would be a tedious job.

So I wonder if the structure can be extracted automatically. In fact, I just need to be located to the repeated root node(div.post and article.entry in the above example), then I can extract the data with some certain rules.

Is this possible?

BTW, I am not exactly sure the name of this kind of algorithms, so the tag of this post maybe wrong, feel free to modify that if true.

Solution

You have to know at least some common patterns to be able to formulate deterministic extraction rules. The solution below is very primitive and by no means optimal, but it might help you:

# -*- coding: utf-8 -*-
import re

import bs4
from bs4 import element
import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        min_occurs = 5
        max_occurs = 1000
        min_depth = 7
        max_depth = 7
        pattern = re.compile('^/html/body/.*/(span|div)$')
        extract_content = lambda e: e.css('::text').extract_first()
        #extract_content = lambda e: ' '.join(e.css('*::text').extract())

        doc = bs4.BeautifulSoup(response.body, 'html.parser')

        paths = {}
        self._walk(doc, '', paths)
        paths = self._filter(paths, pattern, min_depth, max_depth,
                             min_occurs, max_occurs)

        for path in paths.keys():
            for e in response.xpath(path):
                yield {'content': extract_content(e)}

    def _walk(self, doc, parent, paths):
        for tag in doc.children:
            if isinstance(tag, element.Tag):
                path = parent + '/' + tag.name
                paths[path] = paths.get(path, 0) + 1
                self._walk(tag, path, paths)

    def _filter(self, paths, pattern, min_depth, max_depth, min_occurs, max_occurs):
        return dict((path, count) for path, count in paths.items()
                        if pattern.match(path) and
                                min_depth <= path.count('/') <= max_depth and
                                min_occurs <= count <= max_occurs)

It works like this:

Explore HTML document and construct dictionary of all element paths in the document together with their occurences.
Filter those paths based on your general rules which you infer from your web pages.
Extract content from these filtered paths using some common extraction logic.

For building the dictionary of paths I just walk though the document using BeautifulSoup and count occurence of each element path. This can later be used in filtering task for keeping only the most repearing paths.

Next I filter out the paths based on some basic rules. For path to be kept, it has to:

Occur at least min_occurs and at most max_occurs times.
Has length of at least min_depth and at most max_depth.
Match the pattern.

Other rules can be added in similar fashion.

The last part loops through the paths that left you after filtering and extracts content from elements using some common logic defined using extract_content.

If your web pages are rather simple and you could infer such rules, it might work. Otherwise, you would have to look at some kind of machine learning task I guess.