Search code examples
apache-sparkpaginationpysparkrdd

is there any pagination for pyspark rdd?


We have a lot of logs and we want to get meaningful data with some processing. These logs files are really huge and the result is big as well.

We have build spark transformations doing required jobs in spark cluster. I have a huge data after all the transformation which cannot fit in the driver memory. Hence, doing a rdd.collect() is failing.

Is there any pagination kind of action in rdd we can use? some thing like limit in SQL."SELECT * FROM table LIMIT 15, 10"

or any suggestions how to handle this case?


Solution

  • In most of the documents and articles, I see people discuss about 'No support for offset as of now in spark sql and RDD'. Some discussion on the support of OFFSET in Spark can be found from a old spark mailing chain here. And it does make sense in a distributed system the offset access could be really costly. If it is pagination we are interested we can achieve it through filtering RDD with indexes. Index can be obtained with actions zipWithIndex() or zipWithUniqueId() documentation. Similar answers are given in discussion here and here. The SQL and Spark equivalent are given below.

    SQL

    select * from person limit 10, 10
    

    Spark

    result = rdd.zipWithIndex().filter(x => { x._2 >= 10 && x._2 < 20}).collect()
    

    Hope it is useful for someone having similar situation.