Search code examples
pythonweb-scrapingtopic-modelingstackexchange-apistackexchange

Web scraping by tag on stack overflow


I would like to do web scraping on this site (stackoverflow.com), I was wondering if there was an API or some other tool that can be used with Python to get all the comments containing a specific tag.

For example, how do I get all the posts and comments from 10/01/2019 to 01/20/2019 with the python tag?


Solution

  • Have a detailed look at https://api.stackexchange.com/docs/

    You can get all questions from a start date to an end date with a particular tag by making use of the questions method. You need to pass the specific tag into the tagged parameter.

    Here is the URL format for that:
    https://api.stackexchange.com/2.2/questions?fromdate={start_date}&todate={end_date}&order=desc&sort=activity&tagged={tag}&site=stackoverflow

    For example the below link returns all questions from 1st July, 2019 to 5th July, 2019 with tag python:
    https://api.stackexchange.com/2.2/questions?fromdate=1561939200&todate=1562284800&order=desc&sort=activity&tagged=python&site=stackoverflow

    For more information on how the date has been formatted in the above URL, you can have a look at dates.

    Now that you have the question_id, you can make use of questions/{ids}/answers method to get all answers of that question from a start date to an end date.

    Here is the URL format for that:
    https://api.stackexchange.com/2.2/questions/{question_id}/answers?fromdate={start_date}&todate={end_date}&order=desc&sort=activity&site=stackoverflow

    For example the below link returns all answers from 1st January, 2019 to 1st July, 2019 to question with question_id 37181281:
    https://api.stackexchange.com/2.2/questions/37181281/answers?fromdate=1546300800&todate=1561939200&order=desc&sort=activity&site=stackoverflow

    Now you basically have all the posts(questions and answers) from a start date to an end date with a particular tag.

    Since, you have the question_id and answer_id for the posts, you can make use of questions/{ids}/comments method and answers/{ids}/comments method to get the comments on these posts.