Search code examples
pythonpysparkgeospark

ClassNotFoundException geosparksql.UDT.GeometryUDT


I have been trying to convert a GeoPandas dataframe to PySpark Dataframe with no success. Currently, I have extended the DataFrame class to convert a GPD DF to Spark DF with the following:

from pyspark.sql import DataFrame
from pyspark.sql.types import IntegerType, StringType, FloatType, BooleanType, DateType, TimestampType, StructField, StructType
!pip install geospark
from geospark.sql.types import GeometryType

class SPandas(DataFrame):
  def __init__(self, sqlC, objgpd):
    esquema = dict(objgpd.dtypes)
    equivalencias = {'int64' : IntegerType, 'object' : StringType, 'float64' : FloatType, 
                     'bool' : BooleanType, 'datetime64' : DateType,
                     'timedelta' : TimestampType, 'geometry' : GeometryType}

    for clave, valor in esquema.items():
      try:
        esquema[clave] = equivalencias[str(valor)]
      except KeyError:
        esquema[clave] = StringType

    esquema = StructType([ StructField(v, esquema[v](), False) for v in esquema.keys() ])
    datos = sqlC.createDataFrame(objgpd, schema=esquema)
    super(self.__class__, self).__init__(datos._jdf, datos.sql_ctx)

The preceding code compiles without error, but when trying to 'take' an item from the DataFrame I get the following error:

fp = "Paralela/Barrios/Barrios.shp"
map_df = gpd.read_file(fp)
mapa_sp = SPandas(sqlC, map_df)
mapa_sp.take(1)

Py4JJavaError: An error occurred while calling o21.applySchemaToPythonRDD.
: java.lang.ClassNotFoundException: org.apache.spark.sql.geosparksql.UDT.GeometryUDT

The problem is with the 'geometry' column of the GDP DF, as it works flawlessly without it. The 'geometry' column has Shapely Polygon objects which should be recognized by the GeometryType class of GeoSpark.

Is there any way to install org.apache.spark.sql.geosparksql.UDT.GeometryUDT? I'm using Google Colab.


Solution

  • You need to include geospark dependency in hour project and add the jar to your runtime env. classpath . Below version of jar is compatible with spark-core_2.11:2.3.0

    <dependency>
        <groupId>org.datasyslab</groupId>
        <artifactId>geospark</artifactId>
        <version>1.3.1</version>
        <scope>provided</scope>
    </dependency>