Search code examples
pythonweb-scrapingbeautifulsoupscrapyurllib2

Python Webscraping Solution Reccomendations required


I would like to know what is the best/preferred PYTHON 3.x solution (fast to execute, easy to implement, option to specify user agent, send browser & version etc to webserver to avoid my IP being blacklisted) which can scrape data on all of below options (mentioned based on complexity as per my understanding).

  1. Any Static webpage with data in tables / Div
  2. Dynamic webpage which completes loading in one go
  3. Dynamic webpage which requires signin using username password & completes loading in one go after we login. Sample URL for username password: https://dashboard.janrain.com/signin?dest=http://janrain.com
  4. Dynamic web-page which requires sign-in using oauth from popular service like LinkedIn, google etc & completes loading in one go after we login. I understand this involves some page redirects, token handling etc. Sample URL for oauth based logins: https://dashboard.janrain.com/signin?dest=http://janrain.com
  5. All of bullet point 4 above combined with option of selecting some drop-down (lets say like "sort by date") or can involve selecting some check-boxes, based on which the dynamic data displayed would change. I need to scrape the data after the action of check-boxes/drop-downs has been performed as any user would do it to change the display of the dynamic data Sample URL - https://careers.microsoft.com/us/en/search-results?rk=l-seattlearea You have option of drop-down as well as some checkbox in the page
  6. Dynamic webpage with Ajax loading in which data can keep loading as => 6.1 we keep scrolling down like facebook, twitter or linkedin main page to get data Sample URL - facebook, twitter, linked etc => 6.2 or we keep clicking some button/div at the end of the ajax container to get next set of data; Sample URL - https://www.linkedin.com/pulse/cost-climate-change-indian-railways-punctuality-more-editors-india-/ Here you have to click "Show Previous Comments" at the bottom of the page if you need to look & scrape all the comments

I want to learn & build one exhausted scraping solution which can be tweaked to cater to all options from the easy task of bullet point 1 to the complex task of bullet point 6 above as and when required.


Solution

    1. I would recommend to use BeautifulSoup for your problems 1 and 2.
    2. For 3 and 5 you can use Selenium WebDriver (available as python library). Using Selenium you can perform all the possible operations you wish (e.g. login, changing drop down values, navigating, etc.) and then you can access the web content by driver.page_source (you may need to use sleep function to wait until the content is fully loaded)
    3. For 6 you can use their own API to get list of news feeds and their links (mostly the returned object comes with link to a particular news feed), once you get the links you can use BeautifulSoup for get the web content.

    Note: Pleas do read each web site terms and conditions before scraping because some of them have mentioned Automated Data Collection as Unethical behavior which we we should not do as professional.