I am trying to apply a function to one of my dataframe columns to convert the values. The values in the column are like "20160907" I need value to be "2016-09-07".
I wrote a function like this:
def convertDate(inDate:String ): String = {
val year = inDate.substring(0,4)
val month = inDate.substring(4,6)
val day = inDate.substring(6,8)
return year+'-'+month+'-'+day
}
And in my spark scala code, I am using this:
def final_Val {
val oneDF = hiveContext.read.orc("/tmp/new_file.txt")
val convertToDate_udf = udf(convertToDate _)
val convertedDf = oneDF.withColumn("modifiedDate", convertToDate_udf(col("EXP_DATE")))
convertedDf.show()
}
Suprisingly, in spark shell I am able to run without any error. In scala IDE I am getting the below compilation error:
Multiple markers at this line:
not enough arguments for method udf: (implicit evidence$2:
reflect.runtime.universe.TypeTag[String], implicit evidence$3: reflect.runtime.universe.TypeTag[String])org.apache.spark.sql.UserDefinedFunction. Unspecified value parameters evidence$2, evidence$3.
I am using Spark 1.6.2, Scala 2.10.5
Can someone please tell me what I am doing wrong here?
Same code I tried with different functions like in this post: stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column". I am not getting any compilation issues with this code. I am unable to find out the issue with my code
From what I have learned in a spark-summit course, you have to use the sql.functions
methods as much as possible. before implementing your own udf
you have to check if there's no existing function
in the sql.functions
package that does the same work. using the existing functions spark can do a lot of optimizations for you and it will not be obliged to serialize and deserialize you data from and to JVM objects.
to achieve the result you want I'm gonna propose this solution :
val oneDF = spark.sparkContext.parallelize(Seq("19931001", "19931001")).toDF("EXP_DATE")
val convertedDF = oneDF.withColumn("modifiedDate", from_unixtime(unix_timestamp($"EXP_DATE", "yyyyMMdd"), "yyyy-MM-dd"))
convertedDF.show()
this gives the following results :
+--------+------------+
|EXP_DATE|modifiedDate|
+--------+------------+
|19931001| 1993-10-01|
|19931001| 1993-10-01|
+--------+------------+
Hope this help. Best Regards