Search code examples
pythonurllib3socratasoda

SODA API not returning all fields in data set


I am having trouble obtaining all of the fields in a dataset from the Socrata Open Data Network. I am using urllib3 to make the request as such:

url = 'https://data.sfgov.org/resource/g8m3-pdis.json'
http = urllib3.PoolManager(ca_certs=certifi.where())
req = http.request_encode_url('GET', url)


data = json.loads(req.data.decode('utf-8'))

This request returns all but the last 6 fields in the dataset. I have tried using the

$select=*,*

statement to specify which fields I would like to retrieve but that still leaves out the last 6 fields even if I specify them.

However, when I use the

$limit

statement and set it to something greater than 1000 I get some of those fields that I am missing. But, still not all of them.

One thing that does work is specifying which row to grab, so if I use the following:

https://data.sfgov.org/resource/g8m3-pdis.json?ttxid=0000028-02-001

It will return all of the fields that I am looking for.

I guess I could make a request for all of the ttxid fields and then send out a separate request for each one in order to grab all the fields but there has to be a better way to do this. Does anyone have any ideas?


Solution

  • It appears there are a number of fields that are frequently blank: location, business_corridor, lic_code_description, etc. When using JSON documents, Socrata omits blank fields from the response (which is different from the CSV endpoint).

    It looks like the default call, https://data.sfgov.org/resource/g8m3-pdis.json, just happens to return fields that have blanks so does not appear. However, when $limit is added above the default of 1,000, the API happens to returns some rows with complete data so the field appears in limited circumstances.

    This can also be confirmed by grabbing a row which does not have all fields. The OP showed one example that returned all fields: https://data.sfgov.org/resource/g8m3-pdis.json?ttxid=0000028-02-001

    But this example has blank fields so is omitted in the response: https://data.sfgov.org/resource/g8m3-pdis.json?ttxid=1079490-06-161

    One way around this is to grab data from the CSV endpoint: https://data.sfgov.org/resource/g8m3-pdis.csv