Search code examples
pythonpysparkzoneinfo

Python ZoneInfo not working in pyspark UDF


Hi Experts I am stuck with an issue where if I use the ZoneInfo from python zoneinfo package the dates are getting converted as expected, but, when I am using the same code under a spark udf it is throwing error as "No timezone found with key Europe/Brussels". Please help me out here. Below is the code :

Working python code :

from zoneinfo import ZoneInfo
conv_dt = datetime.strptime('2038.01.09 00:00:00', '%Y.%m.%d %H%M%S').astimezone(ZoneInfo('Europe/Brussels'))

Same code under SPARK UDF NOT working dataframe creation :

sample_df = spark.createDataFrame(['2038.01.09 00:00:00'], StringType()).toDF('sampledate')

udf declaration :

def test_udf(sd):
    return datetime.strptime(sd, '%Y.%m.%d %H%M%S').astimezone(ZoneInfo('Europe/Brussels'))

calling udf :

x = udf(test_udf, TimeStamp())
cast_df = sample_df.withColumns('sampledate',x(sample_df['sampledate']))
cast_df.show()

Error : File "...lib/python3.9/zoneinfo/_common.py" ,line 24 in load_tzdata raise

ZoneInfoNotFoundError:"No timezone found with key Europe/Brussels"

Thanks !


Solution

  • The issue you are facing is likely related to the way Spark UDFs work with external Python packages like zoneinfo. When you use zoneinfo in your Spark UDF, it needs to be available to the Spark workers. For troubleshooting, here are some steps to resolve the issue:

    1. Ensure that the zoneinfo package is installed on all Spark worker nodes. You can use a tool like pip or a package manager to install it. You may need administrative access to install packages on worker nodes.

    2. Make sure that the zoneinfo package is importable within your Spark UDF. To do this, you can include the import statement for ZoneInfo at the beginning of your UDF function.

    You can do this by as follows:

    from pyspark.sql import SparkSession
    
    from pyspark.sql.functions import udf
    
    from pyspark.sql.types import StringType, TimestampType
    
    from zoneinfo import ZoneInfo
    
    from datetime import datetime
    
    spark = SparkSession.builder.appName("ZoneInfoExample").getOrCreate()
    
    def test_udf(sd):
        from zoneinfo import ZoneInfo  # Import ZoneInfo within the UDF
        return datetime.strptime(sd, '%Y.%m.%d %H%M%S').astimezone(ZoneInfo('Europe/Brussels'))
    
    x = udf(test_udf, TimestampType())
    
    sample_df = spark.createDataFrame(['2038.01.09 00:00:00'], StringType()).toDF('sampledate')
    
    cast_df = sample_df.withColumn('sampledate', x(sample_df['sampledate']))
    cast_df.show()