Search code examples
scalaapache-sparknscala-time

How to add a new column with day of week based on another in dataframe?


I have a field in a data frame currently formatted as a string (mm/dd/yyyy) and I want to create a new column in that data frame with the day of week name (i.e. Thursday) for that field. I've imported

import com.github.nscala_time.time.Imports._

but am not sure where to go from here.


Solution

  • Create formatter:

    val fmt = DateTimeFormat.forPattern("MM/dd/yyyy")
    

    Parse date:

    val dt = fmt.parseDateTime("09/11/2015")
    

    Get a day of the week:

    dt.toString("EEEEE")
    

    Wrap it using org.apache.spark.sql.functions.udf and you have a complete solution. Still there is no need for that since HiveContext already provides all the required UDFs:

    val df = sc.parallelize(Seq(
       Tuple1("08/11/2015"), Tuple1("09/11/2015"), Tuple1("09/12/2015")
    )).toDF("date_string")
    
    df.registerTempTable("df")
    
    sqlContext.sql(
      """SELECT date_string,
            from_unixtime(unix_timestamp(date_string,'MM/dd/yyyy'), 'EEEEE') AS dow
          FROM df"""
    ).show
    
    // +-----------+--------+
    // |date_string|     dow|
    // +-----------+--------+
    // | 08/11/2015| Tuesday|
    // | 09/11/2015|  Friday|
    // | 09/12/2015|Saturday|
    // +-----------+--------+
    

    EDIT:

    Since Spark 1.5 you can use from_unixtime, unix_timestamp functions directly:

    import org.apache.spark.sql.functions.{from_unixtime, unix_timestamp}
    
    df.select(from_unixtime(
      unix_timestamp($"date_string", "MM/dd/yyyy"), "EEEEE").alias("dow"))