I have written a udf in pyspark like below:
df1 = df.where(point_inside_polygon(latitide,longitude,polygonArr))
df1 and df are spark dataframes
The function is given below:
def point_inside_polygon(x,y,poly):
latt = float(x)
long = float(y)
if ((math.isnan(latt)) or (math.isnan(long))):
point = sh.geometry.Point(latt, long)
polygonArr = poly
polygon=MultiPoint(polygonArr).convex_hull
if polygon.contains(point):
return True
else:
return False
else:
return False
But when I tried checking the data type of latitude and longitude, its a class of column. The data type is Column
Is there a way to iterate through each tuple and use their values, instead of taking the data type column. I don't want to use a for loop because I have a huge recordset and it defeats the purpose of using SPARK.
Is there a way to accomplish to pass the column values as float, or converting them inside the function?
Wrap it using udf:
from pyspark.sql.types import BooleanType
from pyspark.sql.functions import udf
point_inside_polygon_ = udf(point_inside_polygon, BooleanType())
df1 = df.where(point_inside_polygon(latitide,longitude,polygonArr))