Search code examples
pythonseleniumweb-scrapingbeautifulsoupmechanize

How to approach web-scraping in python


I am new to python just started on python web-scraping. I have to scrape data from this realtor site

I need to scrape all the details op read-state agents according to their real-state agency; For this on the web-browser I have to follow the following instructions

  1. Go to this site
  2. click on agency offices button, enter 4000 pin in search box and then submit.
  3. then we get list of the agencies.
  4. go to our team tab and then we get agents their.
  5. then we have to go to each agents page and record their information.

Can anyone tell me how to approach this. Whats the best way to make this type of scrapers.

Do i have to use selenium for the interaction with the pages.

I have worked on request, BeautifulSoup and simple form submit using mechanize


Solution

  • I would recommend on a searching site that you either use Selenium or Requests with sessions, the advantage of Selenium it it will probably work however it will be slow. For Selenium you should just use the Selenium IDE (Firefox add on) to record what you do then get the HTML from the webpage and use beautifulsoup to parse the data.

    If you want to scrape the data quickly and without using much resources I usually use Requests with Sessions. To scrape a website like this you should open up a modern web browser (Firefox, Chrome) and use the network tools for that browser (usually located in developer tools or via right click inspect element). Once you are recording the network you can interact with the webpage to see the connections made to the server. In an example search they may use suggestions e.g

    https://suggest.example.com.au/smart-suggest?query=4000&n=7&regions=false
    

    The response then will probably be a JSON of the suggested results. Once you select a suggestion you can just submit a request with that search parameters e.g

    https://www.example.com.au/find-agent/agents/petrie-terrace-qld-4000
    

    The URLs for the agents will the be in that HTML page, you just then need to separately send a request to each page to get the information using BeautifulSoup.