Search code examples
databricksgeojsonazure-databricksgeopandas

Reading GeoJSON in databricks, no mount point set


We have recently made changes to how we connect to ADLS from Databricks which have removed mount points that were previously established within the environment. We are using databricks to find points in polygons, as laid out in the databricks blog here: https://databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html

Previously, a chunk of code read in a GeoJSON file from ADLS into the notebook and then projected it to the cluster(s):

nights = gpd.read_file("/dbfs/mnt/X/X/GeoSpatial/Hex_Nights_400Buffer.geojson")
a_nights = sc.broadcast(nights) 

However, the new changes that have been made have removed the mount point and we are now reading files in using the string:

"wasbs://[email protected]/X/Personnel/*.csv"

This works fine for CSV and Parquet files, but will not load a GeoJSON! When we try this, we get an error saying "File not found". We have checked and the file is still within ADLS.

We then tried to copy the file temporarily to "dbfs" which was the only way we had managed to read files previously, as follows:

dbutils.fs.cp("wasbs://[email protected]/X/GeoSpatial/Nights_new.geojson", "/dbfs/tmp/temp_nights")
nights = gpd.read_file(filename="/dbfs/tmp/temp_nights")
dbutils.fs.rm("/dbfs/tmp/temp_nights")
a_nights = sc.broadcast(nights) 

This works fine on the first use within the code, but then a second GeoJSON run immediately after (which we tried to write to temp_days) fails at the gpd.read_file stage, saying file not found! We have checked with dbutils.fs.ls() and can see the file in the temp location.

So some questions for you kind folks:

  1. Why were we previously having to use "/dbfs/" when reading in GeoJSON but not csv files, pre-changes to our environment?
  2. What is the correct way to read in GeoJSON files into databricks without a mount point set?
  3. Why does our process fail upon trying to read the second created temp GeoJSON file?

Thanks in advance for any assistance - very new to Databricks...!


Solution

  • Pandas uses the local file API for accessing files, and you accessed files on DBFS via /dbfs that provides that local file API. In your specific case, the problem is that even if you use dbutils.fs.cp, you didn't specify that you want to copy file locally, and it's by default was copied onto DBFS with path /dbfs/tmp/temp_nights (actually it's dbfs:/dbfs/tmp/temp_nights), and as result local file API doesn't see it - you will need to use /dbfs/dbfs/tmp/temp_nights instead, or copy file into /tmp/temp_nights.

    But the better way would be to copy file locally - you just need to specify that destination is local - that's done with file:// prefix, like this:

    dbutils.fs.cp("wasbs://[email protected]/...Nights_new.geojson", 
       "file:///tmp/temp_nights")
    

    and then read file from /tmp/temp_nights:

    nights = gpd.read_file(filename="/tmp/temp_nights")