I have a shapefile on my HDFS and I would like to import it in my Jupyter Notebook with geopandas
(version 0.8.1
).
I tried the standard read_file()
method but it does not recognize the HDFS directory; instead I believe it searches in my local directory, as I made a test with the local directory and reads the shapefile correctly.
This is the code I used:
import geopandas as gpd
shp = gpd.read_file('hdfs://hdfsha/my_hdfs_directory/my_shapefile.shp')
and the error I obtained:
---------------------------------------------------------------------------
CPLE_OpenFailedError Traceback (most recent call last)
fiona/_shim.pyx in fiona._shim.gdal_open_vector()
fiona/_err.pyx in fiona._err.exc_wrap_pointer()
CPLE_OpenFailedError: hdfs://hdfsha/my_hdfs_directory/my_shapefile.shp: No such file or directory
During handling of the above exception, another exception occurred:
DriverError Traceback (most recent call last)
<ipython-input-17-3118e740e4a9> in <module>
----> 2 shp = gpd.read_file('hdfs://hdfsha/my_hdfs_directory/my_shapefile.shp' class="ansi-blue-fg">)
3 print(shp.shape)
4 shp.head(3)
/opt/venv/geocoding/lib/python3.6/site-packages/geopandas/io/file.py in _read_file(filename, bbox, mask, rows, **kwargs)
94
95 with fiona_env():
---> 96 with reader(path_or_bytes, **kwargs) as features:
97
98 # In a future Fiona release the crs attribute of features will
/opt/venv/geocoding/lib/python3.6/site-packages/fiona/env.py in wrapper(*args, **kwargs)
398 def wrapper(*args, **kwargs):
399 if local._env:
--> 400 return f(*args, **kwargs)
401 else:
402 if isinstance(args[0], str):
/opt/venv/geocoding/lib/python3.6/site-packages/fiona/__init__.py in open(fp, mode, driver, schema, crs, encoding, layer, vfs, enabled_drivers, crs_wkt, **kwargs)
255 if mode in ('a', 'r'):
256 c = Collection(path, mode, driver=driver, encoding=encoding,
--> 257 layer=layer, enabled_drivers=enabled_drivers, **kwargs)
258 elif mode == 'w':
259 if schema:
/opt/venv/geocoding/lib/python3.6/site-packages/fiona/collection.py in __init__(self, path, mode, driver, schema, crs, encoding, layer, vsi, archive, enabled_drivers, crs_wkt, ignore_fields, ignore_geometry, **kwargs)
160 if self.mode == 'r':
161 self.session = Session()
--> 162 self.session.start(self, **kwargs)
163 elif self.mode in ('a', 'w'):
164 self.session = WritingSession()
fiona/ogrext.pyx in fiona.ogrext.Session.start()
fiona/_shim.pyx in fiona._shim.gdal_open_vector()
DriverError: hdfs://hdfsha/my_hdfs_directory/my_shapefile.shp: No such file or directory
So, I was wondering whether it is actually possible to read a shapefile, stored in HDFS, with geopandas. If yes, how?
If someone is still looking for an answer to this question, I managed to find a workaround.
First of all, you need a .zip file which contains all the data related to your shapefile (.shp, .shx, .dbf, ...). Then, we use pyarrow
to establish a connection to HDFS and fiona
to read the zipped shapefile.
Package versions I'm using:
pyarrow==2.0.0
fiona==1.8.18
The code:
# import packages
import pandas as pd
import geopandas as gpd
import fiona
import pyarrow
# establish a connection to HDFS
fs = pyarrow.hdfs.connect()
# read zipped shapefile
with fiona.io.ZipMemoryFile(fs.open('hdfs://my_hdfs_directory/my_zipped_shapefile.zip')) as z:
with z.open('my_shp_file_within_zip.shp') as collection:
gdf = gpd.GeoDataFrame.from_features(collection)
print(gdf.shape)