Search code examples
jsonsnowflake-cloud-data-platformgeojson

Reading JSON files in Snowflake


I have wasted many hours on this, and I still don't understand how to do it properly and how to troubleshoot it when it doesn't work. Sometimes, when I'm lucky, it works. I read questions like this How can I load large JSON files from Snowflake Stage? but they didn't help.

Most of the times I get an "error parsing json" from Snowflake (100069 (22P02): Error parsing JSON: document is too large, max size 16777216 bytes).

I am trying to load a geojson file, but I have similar problems with normal json files. In this case, my file looks like this:

{
"type": "FeatureCollection",
"name": "myshapefileinjson",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
"features": [
{ "type": "Feature", "properties": { "fid": 1, "string_id": "111011" etc..}
{ "type": "Feature", "properties": { "fid": 2, "string_id": "1110114" etc..}
...
]

I am trying to load it into Snowflake, one row for each feature, like this:

copy into raw.myschema.mytable
from (
  select 
    $1
    //"$1:features" gives the same error
  from @my_aws_s3_stage/myfile.geojson.gz
)
file_format = (type = JSON, strip_outer_array = True) 

The geoJSON file is ~1GB uncompressed, but each item in the "features" array is not larger than 30KB. I thought that if I do things right, the 16MB limit applies to each row, so I should be able to load this fine?

Why is it not working? How can I troubleshoot the problem when I get this error? I understand Snowflake is trying to load the entire file in the same row, but I don't understand why. What surprises me is that this is a standard geoJSON file, which Snowflake boasts they support. But it seems to me that they don't..


Solution

  • Answering my own questions:

    1. Although Snowflake boasts they support geoJSON, the documentation doesn't make clear there is a key limitation: your geoJSON file is going to be read all in one row, and so you are limited to geoJSON files individually not larger than 16MB. This is because Snowflake cannot strip away the head dictionary, which is in all standard geoJSON files (this part:
    {
    "type": "FeatureCollection",
    "name": "myshapefileinjson",
    "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
    "features": ...
    }
    
    1. the simplest solution for me was to convert a geoJSON file to a geoJSONSeq, which strips away that head JSON dictionary. A geoJSONSeq is a sequence of JSON features inside an array. That's exactly what Snowflake wants, and it will read any file size, one feature per row, if you set strip_outer_array = True. I did this conversion in Python, with GeoPandas, by running this line: mygeodataframe.to_file(myfile.geojson', driver = 'GeoJSONSeq'), but you can use other tools to save to this format.

    In summary, although Snowflake boasts they support geoJSON, it's practically impossible to find good examples of how to do it, and they should make clear in their documentation that there is a limit of 16MB to the filesize. The solution is to use a geoJSONSeq.