Search code examples
apache-sparkpysparkapache-spark-sqlhivecontext

how to apply Windows function in HiveQL in spark


I have seen posts discussing the usage of windows function. But i have some questions.

  1. Since it is can only be used in HiveContext. How can i switch between SparkSQLContext and HiveContext given i am already using SparkSQLContext?
  2. How is that possible to run a HiveQL using windows function here? I tried

    df.registerTempTable("data")
    from pyspark.sql import functions as F
    from pyspark.sql import Window
    

    %%hive
    SELECT col1, col2, F.rank() OVER (Window.partitionBy("col1").orderBy("col3") 
    FROM data
    

and native Hive SQL

SELECT col1, col2, RANK() OVER (PARTITION BY col1 ORDER BY col3) FROM data

but neither of them works.


Solution

  • How can i switch between SparkSQLContext and HiveContext given i am already using SparkSQLContext?

    You cannot. Spark data frames and tables are bound to a specific context. If you want to use HiveContext then use it all the way. You drag all the dependencies anyway.

    How is that possible to run a HiveQL using windows function here

    sqlContext = ...  # HiveContext 
    sqlContext.sql(query)
    

    The first query you use is simply invalid. The second one should work if you use correct context and configuration.