Search code examples
pythonrcsvscreen-scraping

web scraping from site that generates csv file from form (http://njdep.rutgers.edu/continuous/data.php)


I am interesting in scraping the dataset from http://njdep.rutgers.edu/continuous/data.php in order to create a shiny app that allows one to search through the data contained at that site.

Once you fill out the form on the site, it can generate a .csv file. Is there anyway to find out where all of the data from the earliest date to the most recent state is stored and extract it using an R package or python package?


Solution

  • In a browser you can right click and inspect the page. When you click the download button, you can see the underlying rest api in the network tab. It should look something like this:

    http://njdep.rutgers.edu/continuous/data/downloadData.php?affiliation=NJDEP+-+Marine+Water+Monitoring&project=-1&huc14=-1&county=-1&munis=-1&station_type=-1&station=-1&start_date=&end_date=&params=
    

    If you change the various form parameters you can get an idea of how to change the url to get different variations of data. Then you could use a package like requests to get the data in python.

    import requests
    
    url = 'your_modified_url'
    res = requests.get(url)
    res.raise_for_status()
    data = res.content