Search code examples
pysparkdata-structuresdatetime-format

PySpark StructField StringType or TimestampType


I have a schema (StructField, StructType) for pyspark dataframe, we have a date column(value e. g: 2023-10-05). Should this date format data using StringType or TimestampType? I believe StructField only has StringType or TimestampType but not something like DateType.

new_schema = [StructField("item_id", StringType(), True),
                     StructField("date", TimestampType(), True),
                     StructField("description", StringType(), True)]

I am more prefer using String for date instead of Timestamp for below reason.

1, TimestampType is more used for like streaming data for seconds, milliseconds data where people care about real time. In our case we only need date. String is enough.

2, From consistent perspective, string is more stable to transfer than timestamp.

3, From cast perspective, string to date is more like cast down and timestamp to date is more like cast up, it is safer cast string to date than timestamp to date.

Not sure if my points are valid and appreciate your opinion.

Curious if anyone know why pyspark StructField only has StringType or TimestampType but no dateType?


Solution

  • Figure out it can use DateType.

    from pyspark.sql.types import StructType, StructField, StringType, DateType
    

    So you can you use DateType