Search code examples
pythonpython-3.xseleniumweb-scrapingarcgis

Unable to scrape different owner names from different box-like containers out of a map


I'm trying to click on a map using selenium so that I can scrape parcel id and owner name from box-like containers. When a click is made on that map, box-like container shows up. I would like to scrape parcel id and owner name from such container. This is how a box-like container looks like. I tried using requests but could not find any way to locate the information available in such containers, so I'm trying now using selenium. The script below neither clicks on that map, nor throws any error.

website with map

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

link = "http://app01.cityofboston.gov/parcelviewer/"

driver = webdriver.Chrome()
driver.get(link)
wait = WebDriverWait(driver, 20)
for item in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "svg#mapDiv_gc"))):
    item.click()
driver.quit()

How can I grab the parcel Ids and the owner names from different box-like containers out of that map?


Solution

  • This is data coming from ArcGIS REST Service.

    I've located this Argis query call that returns the wanted data :

    GET https://services.arcgis.com/sFnw0xNflSi8J0uh/arcgis/rest/services/Parcels19WMFull/FeatureServer/0/query
    

    I've checked out what could be generating this url and found the following :

    This query call is called when you search for data in the input on the top left corner. You can edit the url parameters to match all data :

    {
        "f": "json",
        "where": "1=1",
        "returnGeometry": "true",
        "spatialRel": "esriSpatialRelIntersects",
        "outFields": "*",
        "outSR": "102100"
    }
    

    It returns a maximum of 2000 items, so we'll need to iterate. To known how to iterate, we can checkout the content in the features array, checkout this queryit gives something like that :

    {
      "attributes": {
        "FID": 1,
        "FULL_ADDRE": "104 A 104 PUTNAM ST, 02128",
        "PID": "0100001000"
      }
    },
    {
      "attributes": {
        "FID": 2,
        "FULL_ADDRE": "18 LEVERETT AV #10-B, 02128",
        "PID": "0101399120"
      }
    },
    {
      "attributes": {
        "FID": 3,
        "FULL_ADDRE": "197 LEXINGTON ST, 02128",
        "PID": "0100002000"
      }
    }
    ....
    

    So we can iterate over the FID field using where=FID > 2000 and for the next iteration we can just store the last FID we get and edit the where clause with FID > {last_fid}

    So you can build a script like this :

    import requests
    
    base_url = "http://app01.cityofboston.gov/parcelviewer"
    
    # get map id
    r = requests.get(f"{base_url}/config/ParcelViewer.json")
    map_id = r.json()["values"]["webmap"]
    
    # get the query url
    r = requests.get(f"https://www.arcgis.com/sharing/rest/content/items/{map_id}/data", params = {
        "f": "json"
    })
    url = r.json()["operationalLayers"][0]["url"]
    
    params = {
        "f": "json",
        "where": "1=1",
        "returnGeometry": "true",
        "spatialRel": "esriSpatialRelIntersects",
        "outFields": "*",
        "outSR": "102100"
    }
    
    data = []
    count = 1
    finish = False
    
    while finish == False:
        print(f"[{count}] requesting...")
        r = requests.get(f"{url}/query", params = params)
        entries = r.json()["features"]
        if len(entries) < 2000:
            finish = True
        else:
            last_fid = entries[-1]["attributes"]["FID"]
            print(f"next fid : {last_fid}")
            params["where"] = f"FID > {last_fid}"
        data.extend(entries)
        print(f"[{count}] received {len(entries)} items - total received : {len(data)}")
        count +=1
    
    print(f"TOTAL: {len(data)}")
    
    # print the last element (just to check)
    print(data[-1])
    

    After several minutes, the script has extracted 171922 records :

    record count


    This is what an entry looks like :

    {
        'attributes': {
            'FID': 171922,
            'PID_LONG': '2205670000',
            'PID': '2205670000',
            'GIS_ID': '2205670000',
            'FULL_ADDRE': '2203 COMMONWEALTH AV, 02135',
            'OWNER': 'COMMWLTH OF MASS',
            'LAND_USE': 'E',
            'LAND_SF': 34125,
            'LIVING_ARE': 7386,
            'AV_LAND': 1325400,
            'AV_BLDG': 841100,
            'AV_TOTAL': 2166500,
            'GROSS_TAX': 0,
            'ID': 0,
            'SHAPE_Leng': 1003.12908156,
            'SHAPE_Area': 33512.6220608,
            'Shape__Area': 5702.6640625,
            'Shape__Length': 414.046143349521
        },
        'geometry': {
            'rings': [
                [
                    [-7922244.91043368, 5212145.61745703],
                    [-7922247.98527419, 5212105.5446644],
                    [-7922243.75007186, 5212106.29247827],
                    [-7922235.83595224, 5212062.80771992],
                    [-7922239.05526106, 5212062.68000813],
                    [-7922327.54387782, 5212214.66112252],
                    [-7922281.74795739, 5212208.62518937],
                    [-7922266.82960043, 5212207.97287607],
                    [-7922241.02937963, 5212204.61661323],
                    [-7922244.0269726, 5212158.45234151],
                    [-7922244.91043368, 5212145.61745703]
                ]
            ]
        }
    }
    

    One last thing, just to check the result count directly on the API, we could use the query parameter from the Arcgis query UI like this one (which is the map used in the website by the way). When filtered by count only, it adds the field returnCountOnly=true, lets do that in our query endpoint :

    https://services.arcgis.com/sFnw0xNflSi8J0uh/arcgis/rest/services/Parcels19WMFull/FeatureServer/0/query?f=json&where=1%3D1&returnGeometry=false&spatialRel=esriSpatialRelIntersects&outFields=FID%2CFULL_ADDRE%2CPID&outSR=102100&returnCountOnly=true

    which returns correctly :

    {"count":171922}
    

    Note that you can apply some variant of this script for any Arcgis Rest service query type. I've made an example on this gist to get the data from the map (cities). Note that the max result returned by the API may change depending on the service