Search code examples

In spark, how to estimate the number of elements in a dataframe quickly

In spark, is there a fast way to get an approximate count of the number of elements in a Dataset ? That is, faster than Dataset.count() does.

Maybe we could calculate this information from the number of partitions of the DataSet, could we ?


  • You could try to use countApprox on RDD API, altough this also launches a Spark job, it should be faster as it just gives you an estimate of the true count for a given time you want to spend (milliseconds) and a confidence interval (i.e. the probabilty that the true value is within that range):

    example usage:

    val cntInterval = df.rdd.countApprox(timeout = 1000L,confidence = 0.90)
    val (lowCnt,highCnt) = (cntInterval.initialValue.low, cntInterval.initialValue.high)

    You have to play a bit with the parameters timeout and confidence. The higher the timeout, the more accurate is the estimated count.