Search code examples
pythondataframeapache-sparkpyspark

Where can I find an exhaustive list of actions for spark?


I want to know exactly what I can do in spark without triggering the computation of the spark RDD/DataFrame.

It's my understanding that only actions trigger the execution of the transformations in order to produce a DataFrame. The problem is that I'm unable to find a comprehensive list of spark actions.

Spark documentation lists some actions, but it's not exhaustive. For example show is not there, but it is considered an action.

  • Where can I find a full list of actions?
  • Can I assume that all methods listed here are also actions?

Solution

  • All the methods annotated in the with @group action are actions. They can be found as a list here in scaladocs. They can also be found in the source where each method is defined, looking like this:

       * @group action
       * @since 1.6.0
       */
      def show(numRows: Int): Unit = show(numRows, truncate = true)
    

    Additionally, some other methods do not have that annotation, but also perform an eager evaluation: Those that call withAction. Checkpoint, for example, actually performs an action but isn't grouped as such in the docs:

    private[sql] def checkpoint(eager: Boolean, reliableCheckpoint: Boolean): Dataset[T] = {
        val actionName = if (reliableCheckpoint) "checkpoint" else "localCheckpoint"
        withAction(actionName, queryExecution) { physicalPlan =>
          val internalRdd = physicalPlan.execute().map(_.copy())
          if (reliableCheckpoint) {
    

    To find all of them

    1. Go to the source
    2. Use control + F
    3. Search for private def withAction
    4. Click on withAction
    5. On the right you should see a list of methods that use them. This is how that list currently looks:

    current withAction methods