I need to query cassandra tables with Spark. I am using a R library called sparklyr. When I try to use where condition on partitioning keys(my cassandra table has 2 partitioning keys), there is no problem giving 1 partitioning key each. But if I use multiple partitioning key each, it takes too much time. How can I handle this problem. (There is no problem with pyspark.)
I tried to use sparlyr, dplyr, DBI libraries. But I can't solve it.
My successfully query is;
spark_session(sc) %>% invoke("sql", "select * from sensor_data")%>%
invoke("where", "sensor_id=109959 and filter_time ='2018060813'")%>%
invoke("count")
#it takes 2 secs. (Number of spark tasks: 2)
#
The problem is;
spark_session(sc) %>% invoke("sql", "select * from sensor_data")%>%
invoke("where", "sensor_id=109959 and filter_time in ('2018060813','2018061107')")%>%
invoke("count")
#It takes 9 mins.(Number of spark tasks: 987)
#
I think I couldn't use partitioning keys effectively in "in". How can I solve it? Is there any idea for about that?
The problem has been solved by removing " ' ".
The old value is '2018121205'
, the new value is 2018121205
.
It worked for me..