Search code examples
rapache-sparkcassandrasparklyr

Sparklyr Query Problem With Cassandra "in" clause


I need to query cassandra tables with Spark. I am using a R library called sparklyr. When I try to use where condition on partitioning keys(my cassandra table has 2 partitioning keys), there is no problem giving 1 partitioning key each. But if I use multiple partitioning key each, it takes too much time. How can I handle this problem. (There is no problem with pyspark.)

I tried to use sparlyr, dplyr, DBI libraries. But I can't solve it.

My successfully query is;

spark_session(sc) %>% invoke("sql", "select * from sensor_data")%>%
  invoke("where", "sensor_id=109959 and filter_time ='2018060813'")%>%
  invoke("count")

#it takes 2 secs. (Number of spark tasks: 2)
#

The problem is;

spark_session(sc) %>% invoke("sql", "select * from sensor_data")%>%
  invoke("where", "sensor_id=109959 and filter_time in ('2018060813','2018061107')")%>%
  invoke("count")

#It takes 9 mins.(Number of spark tasks: 987)
#

I think I couldn't use partitioning keys effectively in "in". How can I solve it? Is there any idea for about that?


Solution

  • The problem has been solved by removing " ' ".

    The old value is '2018121205', the new value is 2018121205.

    It worked for me..