My logs look like this:
client_id;event_1;event_2;event3
And i would like to get an SQL Table like this:
client_id | event
---------------------
... | event_1
... | event_2
... | event_3
I am new to Hive, it seems to me that one log line always provides one entry in the resulting SQL table. I tried the following (unsuccessful):
CREATE EXTERNAL TABLE IF NOT EXISTS tablename (
client_id String,
`event` String
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^([^\;]+);.*([^\;]+).*$" )
LOCATION 's3://myBucket/prefix/';
It takes only the first event and ignore the others...
Unfortunately, it is not possible to generate rows using SerDe in table DDL. It's possible to do the same in Hive.
(1) Read all user events as a single column:
CREATE EXTERNAL TABLE IF NOT EXISTS tablename (
client_id String,
events String
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^([^\\;]+)\\;(.*)$" )
LOCATION 's3://myBucket/prefix/';
Check, it should read two columns, user_id and all events concatenated:
'client_id'
and 'event_1;event_2;event3'
(2) Split events and explode to generate rows:
select t.user_id, e.event
from tablename t
lateral view outer explode(split(t.events,'\\;')) e as event;
Read also about Lateral View.
In Athena use UNNEST with CROSS JOIN:
select t.user_id, e.event
from tablename t
CROSS JOIN UNNEST(SPLIT(t.events,';')) AS e (event)