Search code examples
sqlregexapache-spark

Optimizing a very slow regexp_extract


I have the following columns in hive:

id     |my_list                               |
-------+--------------------------------------+
52     |D.07.01.01=1;D.07.01.02=2;D.07.01.03=1|

I would like to create another table which contains all of the elements in the above list as columns:

id     |D.07.01.01|D.07.01.02|D.07.01.03|
-------+--------------------------------+
52     |LOW       |MODERATE  |LOW       |

I did the following:

SELECT 
  id,
  CASE WHEN my_list rlike 'D.07.01.01' THEN CASE REGEXP_EXTRACT (my_list, '.*?D\\.07\\.01\\.01=(\\d)')
            WHEN '1' THEN 'LOW'
            WHEN '2' THEN 'MODERATE' 
            END
  END AS D.07.01.01,
  CASE WHEN my_list rlike 'D.07.01.02' THEN CASE REGEXP_EXTRACT (my_list, '.*?D\\.07\\.01\\.02=(\\d)')
            WHEN '1' THEN 'LOW'
            WHEN '2' THEN 'MODERATE' 
            END
  END AS D.07.01.02,
  CASE WHEN my_list rlike 'D.07.01.03' THEN CASE REGEXP_EXTRACT (my_list, '.*?D\\.07\\.01\\.03=(\\d)')
            WHEN '1' THEN 'LOW'
            WHEN '2' THEN 'MODERATE' 
            END
  END AS D.07.01.03
  ...

There are in reality 30 values like that not only 3. This works but with very very bad performances (2 hours for 50k rows). I am looking for a way to make it faster.

Is my REGEX format inside REGEXP_EXTRACT good? I am having doubts about it.

Thank you


Solution

  • In your case, you execute rlike SQL function 30 times and also your regexp 30 times for each line. You don't need to resort to regexp in this case.

    You can first transform your list as a string to a map using Spark SQL function str_to_map (exist since Spark 2.0.1) and then call all your different values from this map, as follows:

    SELECT
      id,
      CASE my_map['D.07.01.01']
        WHEN '1' THEN 'LOW'
        WHEN '2' THEN 'MODERATE'
      END AS `D.07.01.01`,
      CASE my_map['D.07.01.02']
        WHEN '1' THEN 'LOW'
        WHEN '2' THEN 'MODERATE'
      END AS `D.07.01.02`,
      CASE my_map['D.07.01.03']
        WHEN '1' THEN 'LOW'
        WHEN '2' THEN 'MODERATE'
      END AS `D.07.01.03`,
      ...
    FROM (SELECT id, str_to_map(my_list, ';', '=') AS my_map FROM my_table)
    

    By doing so, you will build map only once and remove totally calls to regexp and rlike, that should make your request faster