I am doing the following in spark sql:
spark.sql("""
SELECT
data.data.location.geometry.coordinates[0]
FROM df""")
This works fine, however I do not want to use raw SQL, I use dataframe API like so:
df.select("data.data.location.geometry.coordinates[0]")
Unfortunately this does not work:
AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "data.data.location.geometry.coordinates[0]" due to data type mismatch: Parameter 2 requires the "INTEGRAL" type, however "0" has the type "STRING".;
'Project [data#680.data.location.geometry.coordinates[0] AS 0#697]
+- Relation [data#680,id#681,idempotencykey#682,source#683,specversion#684,type#685] json
I know that I can use the F.col api and go with a getItem(0), but is there a built-in way to have the shortcut of getItem?
'.' is the shortcut of getField is there one for array slicing?
Thank you for your insight
TL;DR; use selectExpr()
df.selectExpr('data.data.location.geometry.coordinates[0]')
Looks like indexing inside an array using []
is considered an expression, where as data.data.location...
is considered a column name.
DataFrame.select() takes column name as arguments, so doesn't understand [0]
, what you want is DataFrame.selectExpr().
>>>
>>> df = spark.createDataFrame([({
... "data": {
... "location": {
... "geometry": {
... "coordinates": [41.84201, -89.485937]
... }
... }
... }
... },)]).withColumnRenamed('_1', 'data')
>>> df.show(truncate=False)
+-----------------------------------------------------------------------------+
|data |
+-----------------------------------------------------------------------------+
|{data -> {location -> {geometry -> {coordinates -> [41.84201, -89.485937]}}}}|
+-----------------------------------------------------------------------------+
>>> df.printSchema()
root
|-- data: map (nullable = true)
| |-- key: string
| | |-- key: string
| | |-- value: map (valueContainsNull = true)
| | | |-- key: string
| | | |-- value: map (valueContainsNull = true)
| | | | |-- key: string
| | | | |-- value: array (valueContainsNull = true)
| | | | | |-- element: double (containsNull = true)
>>>
>>> df.select('data.data.location.geometry.coordinates').show(truncate=False)
+----------------------+
|coordinates |
+----------------------+
|[41.84201, -89.485937]|
+----------------------+
>>> df.select('data.data.location.geometry.coordinates[1]').show()
+--------------+
|coordinates[1]|
+--------------+
| null|
+--------------+
>>>
>>> df.selectExpr('data.data.location.geometry.coordinates[1]').show()
+----------------------------------------------+
|data[data][location][geometry][coordinates][1]|
+----------------------------------------------+
| -89.485937|
+----------------------------------------------+
>>> df.selectExpr(
... 'data.data.location.geometry.coordinates[0] as coord_lat',
... 'data.data.location.geometry.coordinates[1] as coord_long'
... ).show()
+---------+----------+
|coord_lat|coord_long|
+---------+----------+
| 41.84201|-89.485937|
+---------+----------+
>>>