Search code examples
sqlapache-sparkhiveapache-spark-sqlhiveql

Load Date Column from raw data with '/' separator in sparkSQL


I have columns with data type DATE in sparkSQL e.g.
CREATE TABLE ABC(startDate DATE, EndDate DATE....
and I load data as
LOAD DATA INPATH './input/user.txt' INTO TABLE ABC

In user.txt data is like

2016/06/12 2016/06/15 
2016/06/12 2016/06/15

but it loads data as

null null
null null

if it's

2016-06-12 2016-06-15 
2016-06-12 2016-06-15

then it takes the data correctly.
How to handle data when the date separator is '/ '?
I don't want to replace the separator in input file.
Please help me. Thanks.


Solution

  • I faced this issue before in Hive. I found a workaround for this. First load them as string instead of Data type DATE

    ex:

    CREATE TABLE ABC(startDate string, EndDate string....)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY ']'
    STORED AS TEXTFILE
    LOCATION './input/user.txt';
    

    Then i used string functions to extract date/month/year from above fields. For example

    select substr(date,1,4) as year,substr(date,6,1) as month .... from ABC
    

    one other way is to replace the '/' with '-' and then cast them as DATE type and use Date functions

    example

    select regexp_replace(startDate,'/','-') from ABC
    

    All the above is how to achieve it in Hive. To work on this in spark is also to first load them as string in to a dataframe.

    val s1 = Seq(("2016/06/12", "2016/06/15" ), ("2016/06/12", "2016/06/15")).toDF("x", "y")
    val result = s1.select(regexp_replace($"x","/", "-"),regexp_replace($"y","/", "-")).show()
    
    result 
    +----------+----------+
    | startDate|   EndDate|
    +----------+----------+
    |2016-06-12|2016-06-15|
    |2016-06-12|2016-06-15|
    +----------+----------+
    

    Hope this helps.