I have parquet
data which when read (TIMESTAMP
column) using spark works perfectly fine. Below are the sample records:
scala> data.select($"BATCH_LOAD_DATE").limit(5).show()
+-------------------+
| BATCH_LOAD_DATE|
+-------------------+
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
|2020-02-21 07:35:27|
+-------------------+
I checked the schema and it shows TIMESTAMP
as the data type for one of the columns which is correct. So, no issue with schema either. However, when I am trying to read the same TIMESTAMP
column via Hive, it throws below exception
Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.hive.serde2.io.TimestampWritableV2 (state=,code=0)
I can see from this link that it is an open issue in Hive. Not sure whether it is resolved yet or not. Is there any workaround for this? Something which can be done while loading data or some transformation once loaded?
I figured out an alternative to my own problem. I changed the column type of TIMESTAMP
column to STRING
and while fetching data I used from_unixtime
method to cast that particular column to the intended date format and was able to fetch it.
But, the problem over here was if my date value is 2020-02-27 15:40:22
and when I fetched the data of this column via Hive it was returning EpochSeconds
i.e 15340232000000
.
So, I solved this problem in Hive via below query:
select *, from_unixtime(cast(SOURCE_LOAD_DATE as BIGINT) DIV 1000000) as SOURCE_LOAD_DATE from table_name;
Using the above query I was able to get the proper date with timestamp value.
Note: You will need to cast every column which has timestamp data.
This is the only trick which I could think of. I hope this might help you or others!