Search code examples
pythonrestweb-scrapingarcgis

Save data from ArcGIS feature layer


I've been analyzing data that I manually collect daily from a feature layer in an ArcGIS map (linked below). I want to automize this process and have been looking for ways to use a RESTful API (or something else) to collect this information.

The task is to save this table (screenshot below) as a python dataframe that I can operate on.

I tried using combinations of the GET statement, and combinations of id keys, but I am unfamiliar with APIs and web-scraping.

Is this task feasible? Is it fairly simple to implement? Where would be the starting steps for someone intermediate in Python, but unfamiliar with web-scraping?

Thanks!

link: http://erieny.maps.arcgis.com/apps/opsdashboard/index.html#/dd7f1c0c352e4192ab162a1dfadc58e1

screenshot of website with desired information in yellow square


Solution

  • This website is almost entirely made with javascript. That being said, it's possible to get the information you want as it's using HTTP requests to generate data from an API. Locating the API and making the specific request it needs you can gain the information from that.

    To do this, we need to use the chrome tools network tab. Then do a search for something we know should be in the data. I tried '14001' as I knew that had to be within the data.

    Network Tools, searching for data

    So you can see here that we've searched for the correct data. Scrolling down the XHR part of the network tools, you can see the request URL and all the parameters.

    Now to make this easier on yourself, you should copy the request as a CURL(BASH) seen here. You can copy this into curl.trillworks.com, this will convert that request into python with the requests library.

    enter image description here

    So that being said it's quite easy now with the headers and correct parameters to get the correct data.

    Code Example

    import requests
    import pandas as pd
    
    headers = {
        'Referer': 'http://erieny.maps.arcgis.com/apps/opsdashboard/index.html',
        'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Mobile Safari/537.36',
    }
    
    params = (
        ('f', 'json'),
        ('where', '1=1'),
        ('returnGeometry', 'false'),
        ('spatialRel', 'esriSpatialRelIntersects'),
        ('outFields', '*'),
        ('orderByFields', 'ZIP_CODE asc'),
        ('resultOffset', '0'),
        ('resultRecordCount', '80'),
        ('resultType', 'standard'),
        ('cacheHint', 'true'),
    )
    
    response = requests.get('https://services1.arcgis.com/CgOSc11uky3egK6O/arcgis/rest/services/erie_zip_codes_confirmed_counts/FeatureServer/0/query', headers=headers, params=params)
    
    
    data = response.json()['features']
    lists = []
    for a in data:
        zipcode = a['attributes']['ZIP_CODE']
        confirmed =a['attributes']['CONFIRMED']
        lists.append((zipcode,confirmed))
    
    df = pd.DataFrame(lists,columns=['Zip Code','Confirmed Cases'])
    

    Output of list

    [('14001', 39),
     ('14004', 70),
     ('14006', 30),
     ('14013', 0),
     ('14025', 11),
     ('14026', 4),
     ('14030', 2),
     ('14031', 84),
     ('14032', 48),
     ('14033', 3),
     ('14034', 1),....]
    

    Output of DataFrame

        Zip Code    Confirmed Cases
    0   14001            39
    1   14004            70
    2   14006            30
    3   14013             0
    4   14025            11
    ...  ...            ...
    
    61  14225           257
    62  14226           187
    63  14227           260
    64  14228           128
    65  14260            0
    

    Explanation

    We are import the requests library, which handles HTTP requests easily.

    The requests.get() method processes an URL we give it and gives us back the response. In this case the response is in JSON object format. In the arguments we can specify the headers and parameters we want to make the request.

    So we're using the correct params and headers to make the request, it turns out it's absolutely necessary to give the headers as well as the params. You can test this out and indeed I often just make a simple GET HTTP request without any data to see if it's easy to mimic. In this case you need both params and headers.

    The response.json() method converts the JSON object into a python dictionary.

    Now it takes abit of time to get the information you want, so I encourage you to play about with this.

    It turns out the desired information is within response.json()['features']. Within that is a list of dictionaries. So we have to loop over this. So a refers to each list item which happens to be one dictionary. We then go for the specific keys that get us to the value. In this case, within the attributes key and then postcode key we can get the postcodes and the same within the attributes key there is the confirmed key and we can access the value for confirmed. Again I strongly urge to you play about with the json object converted dictionary to get a feel for this.

    Here I'm appending the variables zipcode and confirmed into a tuple into a list. You could then use this in pandas as shown above.