Search code examples
pythonapache-sparkpyspark

Converting value from Pyspark Row datetime.date to yyyy-mm-dd


I am trying to fetch data from a table that returns a list of Row datetime.date objects. I would like to have them as a list of Varchar/String values.

query = "select device_date from device where device is not null"
res = spark.sql(query).collect()

if len(res) != 0:
    return res[:20]

The returned value seems to be of format

[Row(device_date =datetime.date(2019, 9, 25)), Row(device_date =datetime.date(2019, 9, 17)), Row(device_date =datetime.date(2020, 1, 8))]

I would like to have the following output returned instead:

['2019-09-25','2019-09-17','2020-01-08']

Please advise.


Solution

  • Are you sure you want to collect your data and then have to process them using python ?

    With df = spark.sql(query), depending on the answer :

    YES (python solution)

    out = df.collect()
    
    list(map(lambda x: datetime.datetime.strftime(x.device_date, "%Y-%m-%d"), out))
    
    ['2019-09-25', '2019-09-17', '2020-01-08']
    
    # OR
    
    list(map(str, (x.device_date for x in out)))
    ['2019-09-25', '2019-09-17', '2020-01-08']
    

    NO (Spark solution)

    from pyspark.sql import functions as F
    
    df.select(F.date_format("device_date", "yyyy-MM-dd").alias("device_date")).collect()
                                                                           
    [Row(device_date='2019-09-25'),
     Row(device_date='2019-09-17'),
     Row(device_date='2020-01-08')]
    

    The spark version can also be done directly in SQL :

    query = "select date_format(device_date, 'yyyy-MM-dd') as date_format from device"
    
    spark.sql(query).collect()
    
    [Row(date_format='2019-09-25'),
     Row(date_format='2019-09-17'),
     Row(date_format='2020-01-08')]