Search code examples
apache-sparkhiveapache-spark-sqlhiveql

Problem with explode in Hive or Spark query


There is Hive table with ~ 500,000 rows. It has the single column which keeps the JSON string. JSON stores the measurements from 15 devices organized like this:

company_id=…
device_1:
   array of measurements
      every single measurements has 2 attributes:
        value=
        date=

device_2:
 …
device_3
 …
device_15
...

There are 15 devices in json where every device has the nested array of measurements inside. The size of measurements array is not fixed.

The goal is to get from the measurements only the one with max(date) per device.

The output of SELECT should have the following columns:

company_id
device_1_value
device_1_date
...
device_15_value
device_15_date

I tried to use the LATERAL VIEW to explode the measurements array:

SELECT get_json_object(json_string,'$.company_id),
d1.value, d1.date, ... d15.value, d15.date
FROM T
LATERAL VIEW explode(device_1.measurements) as d1
LATERAL VIEW explode(device_2.measurements) as d2
 …
LATERAL VIEW explode(device_15.measurements) as d15

I can use the result of this SQL as an input for another SQL which will extract the records with max(date) per device.

My approach does not scale well: with 15 devices and 2 measurements per device the single row in input table will generate 2^15 = 32,768 rows using my SQL above.

There are 500,000 rows in input table.


Solution

  • To avoid the cartesian product generated by multiple lateral views I split the original SQL into 15 independent SQLs (one per device) where the single SQL has just 1 lateral view.

    Then I join all 15 SQLs.