Search code examples
pythonweb-scrapingpython-requestsurllib

Web Scraper not getting the full data from a website


I'm trying to scrape this website to prepare a database for blood donation camps using python.

Firstly, while trying to get the website html source code from requests or urllib there is a SSl:certificate_verify_error which i have bypassed by setting verify parameter as False for requests.get() or creating unverified context for urllib (a quick fix), this gets me past the error, but when i see the retrieved source html code, the table content that i need is empty, in website source they are included in the tbody tags but my requests.get() command gets me only these tags and not the content in between them. I'm very new to scraping, a little guidance would be appreciated. ty

from urllib.request import urlopen as uReq
import ssl
from bs4 import BeautifulSoup as soup

my_url = 'https://www.eraktkosh.in/BLDAHIMS/bloodbank/campSchedule.cnt'
sp_context = ssl._create_unverified_context()
uClient = uReq(my_url,context=sp_context)
page_html = uClient.read()
uClient.close()
page_soup=soup(page_html,"html.parser")
table = page_soup.find('tbody')
print (table) #this outputs "<tbody></tbody>"
trow = table.find('tr')
print (trow) #this outputs "None"


First print command gives

<tbody>
</tbody>

and second outputs

None 

Solution

  • It is so because the first request returns an almost empty html scaffold.

    The data you see on the page is being populated by subsequent ajax requests. This one to be exact https://www.eraktkosh.in/BLDAHIMS/bloodbank/nearbyBB.cnt?hmode=GETNEARBYCAMPS&stateCode=-1&districtCode=-1&_=1560150852947

    You can retrieve this info by going to right click -> inspect -> networks tab and reloading the page.

    Opinion: BeautifulSoup is not required for extracting information from this page. The data is readily available in json format from the API mentioned above.

    Hope this helps.