Search code examples
sparkrhivecontext

Select rows except the one that contains min value in Spark using HiveContext


I have a Spark Data Frame that contains Timestamp and Machine Ids. I wish to remove the lowest timestamp value from each group. I tried following code:

sqlC <- sparkRHive.init(sc)
ts_df2<- sql(sqlC,"SELECT ts,Machine FROM sdf2 EXCEPT SELECT MIN(ts),Machine FROM sdf2 GROUP BY Machine")

But the following error is coming:

16/04/06 06:47:52 ERROR RBackendHandler: sql on 35 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
org.apache.spark.sql.AnalysisException: missing EOF at 'SELECT' near 'EXCEPT'; line 1 pos 35

What is the problem? If HiveContext does not support EXCEPT keyword what will be synonymous way of doing the same in HiveContext?


Solution

  • The programming guide for Spark 1.6.1 shows supported and unsupported Hive features in Spark 1.6.1 http://spark.apache.org/docs/latest/sql-programming-guide.html#supported-hive-features

    I don't see EXCEPT in either category. I saw elsewhere that Hive QL doesn't support EXCEPT, or at least did not at that time. Hive QL Except clause

    Perhaps try a table of the mins and then do a left outer join as in that answer?

    SELECT ts, Machine FROM ts mins LEFT OUTER JOIN ts mins ON (ts.id=mins.id) WHERE mins.id IS NULL;
    

    You can also use the sparkR built-in function except(), though I think you would need to create you mins DataFrame first

    exceptDF <- except(df, df2)