Search code examples
sqlhivetimestamphiveqlhue

SQL; Split a value up using sql in hive/hue


needing some advice on splitting a number into a date timestamp, currently using Hue to query the hive db;

In a table I have a column that is used to capture a unique ref for a record. The value looks like this;

219872021081000741

Contained within this is a date and time, I'm looking to extract (using sql) the date/time from this and have it as a column of its own. Here is the breakdown of the number:

Based on the bold values from left to right is DD YYYY MM HHMM

21 987 2021 08 1000 741

regex [0-3]?[0-9]{1}$ref[2][0-9][0-9][0-9][0-1][0-9][0-2][0-9][0-5][0-9][0-9]{3}_"

Using sql, I want to assess the number then create a column that then formats it to DD-MM-YY HHMM as timestamp. Have reviewed some posts, and trying out a few things, but not having much luck. The other sticking point is the DD will not always be 2 values eg, if it was the 1st then it will be 1 not 01.

Trying to incorporate into the below. Thanks in advance for any advice.

select *,
cast((UTC +(60*60*12)*1000)/1000 as TIMESTAMP) as `LocalTime`
from Table.Name
where 
   name rlike 'FieldValue.*'

UPDATE: In a roundabout way I updated the sql to do a count of the value.

  • If it has 17 digits, then i know the day is anywhere from the 1st-9th so I tag it as 17.
  • If it has 18 digits, then I know the day is anywhere from the 10h-endofmonth
  • From here i use substring to return the day components, which I'll bring into a single field via concat or something along those lines.**

Here is the update sql, just need to figure out/get some guidance on how I now determine how to use the new column FieldCount eg it is 17, then substring(FieldValue ,1,1) given its anything from the 1st-9th. If its 18, then substring(FieldValue ,1,2) given its anything from the 10th up.

select *,
    cast((utc+(60*60*12)*1000)/1000 as TIMESTAMP) as `LocalTime`,
    case
    when FieldValue REGEXP '^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$' then '17'
    when FieldValue REGEXP '^[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]$' then '17'
    end FieldCount,
    substring(FieldValue ,6,4) as Years,
    substring(FieldValue ,1,1) as Days,
    substring(FieldValue ,10,2) as Months,
    substring(FieldValue ,12,2) as Hours,
    substring(FieldValue ,14,2) as Minut 
    from table.name

New Update, I changed this now to separate based on case condition. This basically separates out the value into separate fields. Any ideas to concat based on alias field names?

select
AField,
cast((UTC+(60*60*12)*1000)/1000 as TIMESTAMP) as `LocalTime`,
case when length(AField) = 18 then substring(AField,1,2)  else substring(AField,1,1) end Days,
case when length(AField) = 18 then substring(AField,10,2) else substring(AField,9,2) end Months,
case when length(AField) = 18 then substring(AField,6,4) else substring(AField,5,4) end years,
case when length(AField) = 18 then substring(AField,12,2) else substring(caseid,11,2) end Hours,
case when length(AField) = 18 then substring(AField,14,2) else  substring(AField,13,2) end minutes
from table.name

Solution

  • Correct timestamp string representation in Hive is yyyy-MM-dd HH:mm:ss.S. You do not need to extract all parts separately, then concat to get timestamps. Using regexp_replace you can build correct timestamp using backreferences to capturing groups (in round brackets) in the regexp.

    with mytable as(--test dataset, use your table instead
    select stack(2,
    '219872021081000741',
    '19872021081000741'
    ) as AField
    )
    
    select
    case when length(AField) = 18 
         then timestamp(regexp_replace(AField,'^(\\d{2})\\d{3}(\\d{4})(\\d{2})(\\d{2})(\\d{2})\\d{3}$','$2-$3-$1 $4:$5:00.0'))
         else timestamp(regexp_replace(AField,'^(\\d)\\d{3}(\\d{4})(\\d{2})(\\d{2})(\\d{2})\\d{3}$','$2-$3-0$1 $4:$5:00.0'))
    end as result     
    from mytable
    

    Result:

    result  
    2021-08-21 10:00:00.0
    2021-08-01 10:00:00.0
    

    Note: timestamp() construct here is to demonstrate that string produced is compatible with timestamp data type and is being cast correctly, you can keep it as string if you prefer.