Search code examples
apache-sparkpyspark

How to batch up items from a PySpark DataFrame


I have a PySpark data frame and for each (batch of) record(s), I want to call an API. So basically say I have 100000k records, I want to batch up items into groups of say 1000 and call an API. How can I do this with PySpark? Reason for the batching is because the API probably will not accept a huge chunk of data from a Big Data system.

I first thought of LIMIT but that wont be "deterministic". Furthermore it seems like it would be inefficient?


Solution

  • df.foreachPartition { ele =>
       ele.grouped(1000).foreach { chunk =>
       postToServer(chunk)
    }
    

    Code is in scala, you can check same in python. It will create batches of 1000.