I have the following columns in hive:
id |my_list |
-------+--------------------------------------+
52 |D.07.01.01=1;D.07.01.02=2;D.07.01.03=1|
I would like to create another table which contains all of the elements in the above list as columns:
id |D.07.01.01|D.07.01.02|D.07.01.03|
-------+--------------------------------+
52 |LOW |MODERATE |LOW |
I did the following:
SELECT
id,
CASE WHEN my_list rlike 'D.07.01.01' THEN CASE REGEXP_EXTRACT (my_list, '.*?D\\.07\\.01\\.01=(\\d)')
WHEN '1' THEN 'LOW'
WHEN '2' THEN 'MODERATE'
END
END AS D.07.01.01,
CASE WHEN my_list rlike 'D.07.01.02' THEN CASE REGEXP_EXTRACT (my_list, '.*?D\\.07\\.01\\.02=(\\d)')
WHEN '1' THEN 'LOW'
WHEN '2' THEN 'MODERATE'
END
END AS D.07.01.02,
CASE WHEN my_list rlike 'D.07.01.03' THEN CASE REGEXP_EXTRACT (my_list, '.*?D\\.07\\.01\\.03=(\\d)')
WHEN '1' THEN 'LOW'
WHEN '2' THEN 'MODERATE'
END
END AS D.07.01.03
...
There are in reality 30 values like that not only 3. This works but with very very bad performances (2 hours for 50k rows). I am looking for a way to make it faster.
Is my REGEX format inside REGEXP_EXTRACT
good? I am having doubts about it.
Thank you
In your case, you execute rlike
SQL function 30 times and also your regexp 30 times for each line. You don't need to resort to regexp in this case.
You can first transform your list as a string to a map using Spark SQL function str_to_map
(exist since Spark 2.0.1) and then call all your different values from this map, as follows:
SELECT
id,
CASE my_map['D.07.01.01']
WHEN '1' THEN 'LOW'
WHEN '2' THEN 'MODERATE'
END AS `D.07.01.01`,
CASE my_map['D.07.01.02']
WHEN '1' THEN 'LOW'
WHEN '2' THEN 'MODERATE'
END AS `D.07.01.02`,
CASE my_map['D.07.01.03']
WHEN '1' THEN 'LOW'
WHEN '2' THEN 'MODERATE'
END AS `D.07.01.03`,
...
FROM (SELECT id, str_to_map(my_list, ';', '=') AS my_map FROM my_table)
By doing so, you will build map only once and remove totally calls to regexp and rlike, that should make your request faster