Search code examples
apache-sparkapache-spark-sqlapache-zeppelin

How to explode a subset of data in Spark?


I have been trying to repeat the Zeppelin notebook Magellan example but using assets with geo-location information (assets DF) and trying to map them to zip-codes (zipcode DF). I got the zip-code shapefile from USGS and located it into Spark.

Here is what the asset DF looks like. It consists of an asset ID and a point on a map.

+---------+--------------------+
|    asset|               point|
+---------+--------------------+
|       10|Point(-96.7595319...|
|       11|Point(4.7115951, ...|

The zipcode DF is a built up from the USGS shapefile for the US zipcodes. This is were i use Magellan

val zipcodes = magellanContext.read.format("magellan").
load("magellan_us_states").
select($"polygon", $"metadata").
cache()

The results are the zipcode DF are

+--------------------+--------------------+
|             polygon|            metadata|
+--------------------+--------------------+
|Polygon(5, Wrappe...|Map(YEAR_ADM ->  ...|
|Polygon(5, Wrappe...|Map(YEAR_ADM ->  ...|

I then join the two DF together and do a query

val joined = zipcodes.
join(assets).
where($"point" within $"polygon").
select($"asset", explode($"metadata").as(Seq("k", "v"))).
withColumnRenamed("v", "state").
drop("k").
cache()

the results are as follows:

+--------+--------------------+
|  asset#|               state|
+--------+--------------------+
|10      |Arizona             |
|10      |                  48|
|10      |                1903|
|10      |                  04|
|10      |              23.753|
|10      |  February          |
|10      |                1912|
|10      |              28.931|
|10      |                  14|
|11      |North Carolina      |
...

The problem is that when i explode the metadata i only want the states. How do i explode that data so that i only end up with table that looks like this -

+--------+--------------------+
|  asset#|               state|
+--------+--------------------+
|10      |Arizona             |
|11      |North Carolina      |
|12      |Arizona             |
...

Solution

  • How do i explode that data so that i only end up with table that looks like this

    Simply don't use explode. Instead you simply select the field of interest:

    df.select($"asset",  $"metadata".getItem("state").alias("state"))