Search code examples
pythonpysparkjupyter-notebook

Pyspark - Python Set Same Timezone


I am reading some parquet with timezone GMT-4

def get_spark():
    spark = SparkSession.builder.getOrCreate()
    spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
    spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY")
    spark.conf.set("spark.sql.session.timeZone", "GMT-4")
    return spark

File Show



base_so.where(base_so.ID_NUM_CLIENTE == 2273).show()

+--------------+-----------+----------------+------------------+-------------------+-------------------+-------------------+
|ID_NUM_CLIENTE|NUM_TRAMITE|COD_TIPO_1     |COD_TIPO_2        |      FECHA_TRAMITE|      FECHA_INGRESO|  FECHA_INICIO_PAGO|
+--------------+-----------+----------------+------------------+-------------------+-------------------+-------------------+
|          2273|     238171|              X|               NN |2005-10-25 00:00:00|2005-10-25 09:26:54|1995-05-03 00:00:00|
|          2273|     238171|              X|               NMP|2005-10-25 00:00:00|2005-10-25 09:26:54|1995-05-03 00:00:00|
+--------------+-----------+----------------+------------------+-------------------+-------------------+-------------------+


when I create a dataframe from a test it does not leave me the date of the column

        spark = get_spark()
        df_busqueda = spark.createDataFrame(
            data=[
                [Decimal(2273), Decimal(238171), "SO", datetime.strptime('2005-10-25 00:00:00', '%Y-%m-%d %H:%M:%S')],
            ],
            schema=StructType(
                [
                    StructField('ID_NUM_CLIENTE', DecimalType(), True),
                    StructField('NUM_TRAMITE', DecimalType(), True),
                    StructField('COD_TIPO_1', StringType(), True),
                    StructField('FECHA_TRAMITE', TimestampType(), True),

                ]
            ),
        )

+--------------+-----------+----------------+-------------------+
|ID_NUM_CLIENTE|NUM_TRAMITE|COD_TIPO_1      |      FECHA_TRAMITE|
+--------------+-----------+----------------+-------------------+
|          2273|     238171|              SO|2005-10-24 23:00:00|
+--------------+-----------+----------------+-------------------+


How can I better configure so that both the parquet and the dataframes created maintain the same timezone?


Solution

  • You can set the timezone in the spark session.

    Example:

    For spark > 3:

    spark.sql("SET TIME ZONE 'America/New_York'").show()
    //+--------------------------+----------------+
    //|key                       |value           |
    //+--------------------------+----------------+
    //|spark.sql.session.timeZone|America/New_York|
    //+--------------------------+----------------+
    
    
    spark.sql("select current_timestamp()").show()
    //+--------------------------+
    //|current_timestamp()       |
    //+--------------------------+
    //|2021-08-25 16:23:16.096459|
    //+--------------------------+
    

    For spark < 3.0:

    spark.conf.set("spark.sql.session.timeZone", "UTC")
    spark.sql("select current_timestamp()").show()
    //+--------------------+
    //| current_timestamp()|
    //+--------------------+
    //|2021-08-25 20:26:...|
    //+--------------------+