I have a spark.sql query that is wrapped inside a function. I want to pass the query a function argument that is a dataframe but getting some error. Can anyone see if I am doing something wrong?
Function:
1 def my_function(df_table: DataFrame) -> DataFrame:
2
3 sql_query = f"""
4 SELECT DISTINCT dt.CountryId,
5 Cast(RIGHT(dt.RegionIdentifier, 2) as Integer) as RegionID
6 FROM {df_table} dt
7 WHERE dt.CountryCode = 23
8 """
9
10 df = spark.sql(sql_query)
11 return df
and this is how I call it in a notebook:
df_table = spark.table('path_to_table/_location/')
my_function(df_table)
Error message I get:
[PARSE_SYNTAX_ERROR] Syntax error at or near '['. SQLSTATE: 42601
If i remove the {df_table}
on LINE-6, and put the name of table hardcoded, it works. Is there a way to pass in the table name as Dataframe and have it passed as an arg?
and when I print the sql_query, it shows:
SELECT DISTINCT dt.CountryId,
Cast(RIGHT(dt.RegionIdentifier, 2) as Integer) as RegionID
FROM DataFrame[CountryId: bigint, RegionID: int, RegionIdentifier: string, TimeOf: timestamp] dt
WHERE dt.CountryCode = 23
To be able to access the DataFrame within a SQL query, register it as a temporary view using method DataFrame.createOrReplaceTempView:
def my_function(df_table: DataFrame) -> DataFrame:
temp_view_name = "some_temp_view"
df_table.createOrReplaceTempView(temp_view_name)
sql_query = f"""
SELECT DISTINCT dt.CountryId,
Cast(RIGHT(dt.RegionIdentifier, 2) as Integer) as RegionID
FROM {temp_view_name} dt
WHERE dt.CountryCode = 23
"""
df = spark.sql(sql_query)
return df