Search code examples
apache-sparklazy-evaluationdistributed-computingrddapache-spark-sql

How to force Spark to evaluate DataFrame operations inline


According to the Spark RDD docs:

All transformations in Spark are lazy, in that they do not compute their results right away...This design enables Spark to run more efficiently.

There are times when I need to do certain operations on my dataframes right then and now. But because dataframe ops are "lazily evaluated" (per above), when I write these operations in the code, there's very little guarantee that Spark will actually execute those operations inline with the rest of the code. For example:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

// Now we need to do a union RIGHT HERE AND NOW, because
// the next few lines of code require the union to have
// already taken place!
val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)

// Now do some stuff with 'unionDataFrame'...

So my workaround for this (so far) has been to run .show() or .count() immediately following my time-sensitive dataframe op, like so:

val someDataFrame : DataFrame = getSomehow()
val someOtherDataFrame : DataFrame = getSomehowAlso()
// Do some stuff with 'someDataFrame' and 'someOtherDataFrame'

val unionDataFrame : DataFrame = someDataFrame.unionAll(someOtherDataFrame)
unionDataFrame.count()  // Forces the union to execute/compute

// Now do some stuff with 'unionDataFrame'...

...which forces Spark to execute the dataframe op right then in there, inline.

This feels awfully hacky/kludgy to me. So I ask: is there a more generally-accepted and/or efficient way to force dataframe ops to happen on-demand (and not be lazily evaluated)?


Solution

  • No.

    You have to call an action to force Spark to do actual work. Transformations won't trigger that effect, and that's one of the reasons to love .


    By the way, I am pretty sure that knows very well when something must be done "right here and now", so probably you are focusing on the wrong point.


    Can you just confirm that count() and show() are considered "actions"

    You can see some of the action functions of Spark in the documentation, where count() is listed. show() is not, and I haven't used it before, but it feels like it is an action-how can you show the result without doing actual work? :)

    Are you insinuating that Spark would automatically pick up on that, and do the union (just in time)?

    Yes! :)

    remembers the transformations you have called, and when an action appears, it will do them, just in -the right- time!


    Something to remember: Because of this policy, of doing actual work only when an action appears, you will not see a logical error you have in your transformation(s), until the action takes place!