Search code examples
pythonpython-3.xweb-scrapingbeautifulsouprcrawler

Crawling Depth with BeautifulSoup


Is there a function within the beautifulsoup package that allows users to set crawling depth within a site? I am relatively new to Python but I have used Rcrawler in R before and Rcrawler provides 'MaxDepth' so the crawler will go within a certain number of links from the homepage within that domain.

Rcrawler(Website = "https://stackoverflow.com/", no_cores = 4, no_conn = 4, ExtractCSSPat = c("div"), ****MaxDepth=5****)

The basics of my current script in Python parses all visible text on a page but I would like to set a crawling depth.

from bs4 import BeautifulSoup
import bs4 as bs
import urllib.request

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    elif isinstance(element,bs.element.Comment):
        return False
    return True


def text_from_html(body):
    soup = BeautifulSoup(html, 'lxml')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

html = urllib.request.urlopen('https://stackoverflow.com/').read()
print(text_from_html(html))

Any insight or direction is appreciated.


Solution

  • There is no function in BeautifulSoup because BeautifulSoup is not crawler.
    It only parses string with HTML so you could search in HTML.

    There is no function in requests because requests is no crawler too.
    It only reads data from server so you could use it with BeautifulSoup or similar.

    If you use BeautifulSoup and request then you have to do all on your own - you have to build crawling system from scratch.

    Scrapy is real crawler (or rather framework to build spiders and crawl network).
    And it has option DEPTH_LIMIT