Search code examples
apache-sparkpysparkdatabricksevent-logdatabricks-sql

Convert spark sql to python spark / Databricks pipeline event logs


I have the following sql statement to query the databricks pipeline event logs and it works.
I tried to rewrite it into a python code, but I failed.

Could somebody provide me any advice? Many thanks!!

SELECT timestamp, details:user_action:action, details:user_action:user_name
FROM event_log_raw 
WHERE event_type = 'user_action'

Please Note the details column here is string type, not struct nor array

The following solutions I tried didn't work
df is a spark dataFrame generated from the table event_log_raw

df.filter(df.event_type == 'user_action').select("timestamp", "details:user_action:action", "details:user_action:user_name") 
df.filter(df.event_type == 'user_action').select("timestamp", "details.user_action.action", "details.user_action.user_name") 

Solution

  • Instead of using select you need to use selectExpr because the string like details:user_action:action is a JSON path expression for extracting data from the JSON string (doc):

    df.filter(df.event_type == 'user_action')
      .selectExpr("timestamp", "details:user_action:action", "details:user_action:user_name")