Search code examples
amazon-web-serviceshivehiveqlamazon-athena

How to transform data in Amazon Athena


I have some data in S3 location in json format. It have 4 columns val, time__stamp, name and type. I would like to create an external Athena table from this data with some transformations given below:

  1. timestamp: timestamp should be converted from unix epoch to UTC, this I did by using the timestamp data type.
  2. name: name should filtered with following sql logic:
    name not in ('abc','cdf','fgh') and name not like '%operator%'
  3. type: type should not have values labeled as counter
  4. I would like to add two partition columns date and hour which should be derived from time__stamp column

I started with following:

CREATE EXTERNAL TABLE `airflow_cluster_data`(
  `val` string COMMENT 'from deserializer', 
  `time__stamp` timestamp COMMENT 'from deserializer', 
  `name` string COMMENT 'from deserializer', 
  `type` string COMMENT 'from deserializer')
PARTITIONED BY ( 
  date,
  hour)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'mapping.time_stamp'='@timestamp') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bucket1/raw/airflow_data'

I tried various things but couldn't figure out the syntax. Using spark could have been easier but I don't want to run Amazon EMR every hour for small data set. I prefer to do it in Athena if possible.

Please have a look at some sample data:

1533,1636674330000,abc,counter
1533,1636674330000,xyz,timer
1,1636674330000,cde,counter
41,1636674330000,cde,timer
1,1636674330000,fgh,counter
231,1636674330000,xyz,timer
1,1636674330000,abc,counter
2431,1636674330000,cde,counter
42,1636674330000,efg,timer


Solution

  • Probably the simplest method is to create a View:

    CREATE VIEW foo AS
    SELECT
      val,
      cast(from_unixtime(time__stamp / 1000) as timestamp) as timestamp,
      cast(from_unixtime(time__stamp / 1000) as date) as date,
      hour(cast(from_unixtime(time__stamp / 1000) as timestamp)) as hour,
      name,
      type
    FROM airflow_cluster_data
    WHERE name not in ('abc','cdf','fgh')
      AND name not like '%operator%'
      AND type != 'counter'