Is there a function within the beautifulsoup package that allows users to set crawling depth within a site? I am relatively new to Python but I have used Rcrawler in R before and Rcrawler provides 'MaxDepth' so the crawler will go within a certain number of links from the homepage within that domain.
Rcrawler(Website = "https://stackoverflow.com/", no_cores = 4, no_conn = 4, ExtractCSSPat = c("div"), ****MaxDepth=5****)
The basics of my current script in Python parses all visible text on a page but I would like to set a crawling depth.
from bs4 import BeautifulSoup
import bs4 as bs
import urllib.request
def tag_visible(element):
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
elif isinstance(element,bs.element.Comment):
return False
return True
def text_from_html(body):
soup = BeautifulSoup(html, 'lxml')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return u" ".join(t.strip() for t in visible_texts)
html = urllib.request.urlopen('https://stackoverflow.com/').read()
print(text_from_html(html))
Any insight or direction is appreciated.
There is no function in BeautifulSoup
because BeautifulSoup
is not crawler
.
It only parses string with HTML
so you could search in HTML
.
There is no function in requests
because requests
is no crawler
too.
It only reads data from server so you could use it with BeautifulSoup
or similar.
If you use BeautifulSoup
and request
then you have to do all on your own - you have to build crawling system from scratch.
Scrapy is real crawler (or rather framework to build spiders and crawl network).
And it has option DEPTH_LIMIT