Search code examples
pythonapache-sparkpysparkdatabricks

Spark getItem shortcut


I am doing the following in spark sql:

spark.sql("""
            SELECT
            data.data.location.geometry.coordinates[0]
            FROM df""")

This works fine, however I do not want to use raw SQL, I use dataframe API like so:

df.select("data.data.location.geometry.coordinates[0]")

Unfortunately this does not work:

AnalysisException: [DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE] Cannot resolve "data.data.location.geometry.coordinates[0]" due to data type mismatch: Parameter 2 requires the "INTEGRAL" type, however "0" has the type "STRING".;
'Project [data#680.data.location.geometry.coordinates[0] AS 0#697]
+- Relation [data#680,id#681,idempotencykey#682,source#683,specversion#684,type#685] json

I know that I can use the F.col api and go with a getItem(0), but is there a built-in way to have the shortcut of getItem?

'.' is the shortcut of getField is there one for array slicing?

Thank you for your insight


Solution

  • TL;DR; use selectExpr()

    df.selectExpr('data.data.location.geometry.coordinates[0]')


    Looks like indexing inside an array using [] is considered an expression, where as data.data.location... is considered a column name.

    DataFrame.select() takes column name as arguments, so doesn't understand [0], what you want is DataFrame.selectExpr().

    >>>
    >>> df =  spark.createDataFrame([({
    ...     "data": {
    ...         "location": {
    ...             "geometry": {
    ...                 "coordinates": [41.84201, -89.485937]
    ...             }
    ...         }
    ...     }
    ... },)]).withColumnRenamed('_1', 'data')
    >>> df.show(truncate=False)
    +-----------------------------------------------------------------------------+
    |data                                                                         |
    +-----------------------------------------------------------------------------+
    |{data -> {location -> {geometry -> {coordinates -> [41.84201, -89.485937]}}}}|
    +-----------------------------------------------------------------------------+
    
    >>> df.printSchema()
    root
     |-- data: map (nullable = true)
     |    |-- key: string
     |    |    |-- key: string
     |    |    |-- value: map (valueContainsNull = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: map (valueContainsNull = true)
     |    |    |    |    |-- key: string
     |    |    |    |    |-- value: array (valueContainsNull = true)
     |    |    |    |    |    |-- element: double (containsNull = true)
    
    >>>
    
    
    >>> df.select('data.data.location.geometry.coordinates').show(truncate=False)
    +----------------------+
    |coordinates           |
    +----------------------+
    |[41.84201, -89.485937]|
    +----------------------+
    
    >>> df.select('data.data.location.geometry.coordinates[1]').show()
    +--------------+
    |coordinates[1]|
    +--------------+
    |          null|
    +--------------+
    >>>
    
    >>> df.selectExpr('data.data.location.geometry.coordinates[1]').show()
    +----------------------------------------------+
    |data[data][location][geometry][coordinates][1]|
    +----------------------------------------------+
    |                                    -89.485937|
    +----------------------------------------------+
    
    >>> df.selectExpr(
    ...          'data.data.location.geometry.coordinates[0] as coord_lat',
    ...          'data.data.location.geometry.coordinates[1] as coord_long'
    ... ).show()
    +---------+----------+
    |coord_lat|coord_long|
    +---------+----------+
    | 41.84201|-89.485937|
    +---------+----------+
    
    >>>