Search code examples
pythonweb-crawlerurllibrobots.txt

Web crawling domain issue


I'm using a small script to crawl some domain links and generate a sitemap with it.

Right now it's working, it is fairly simple.

But I need to crawl a specific domain though, and this domain for some reason doesn't let me crawl anything, it does have links on it, a sitemap.xml file as well.

I guess there must be some robots.txt or any other server side trick on it for this, assuming this scenario, what could be a workaround for crawling this?

I thought about reading the sitemap xml file and writing it somewhere, but lol, it's a bit of a strange idea.

This is the domain.

And this is the code, although this works fine for now, for other domains:

import urllib.request as urllib2
from bs4 import BeautifulSoup

myurl = "https://www.google.com/"
url = urllib2.urlopen(myurl)

soup = BeautifulSoup(url,'html.parser')

all_links = soup.find_all('a')

for link in all_links:
    print(link.get('href'))

Any idea/workaround for this?

Thanks a lot


Solution

  • The reason why you are not being able to get anything with your script is because the site is written in React, meaning that the links are being populated with Javascript. In order to crawl such sites you will need to use a tool able to execute the embedded Javascript code. You could use something like Selenium or requests-html (from the creator of the famous requests package).