Search code examples
scalaapache-sparklogarithm

How to take the logarithm of an RDD in Spark (Scala)


How do I take the logarithm of an RDD? I have a val rdd: RDD[Double] and I simply want to take the logarithm of it.

This is essentially the same question as this, but the solution proposed does not work. I run:

val rdd: RDD[Double] = <something>
val log_y = rdd.map(x => org.apache.commons.math3.analysis.function.Log(2.0, x))

and I get the error:

error: object org.apache.commons.math3.analysis.function.Log is not a value

Solution

  • As I see, this class from math3 computes the natural log function and it works like:

    new org.apache.commons.math3.analysis.function.Log().value(3)
    res1: Double = 1.0986122886681098
    

    It is the version that comes with the Spark 3.1.2.

    val log_y = rdd.map(x => org.apache.commons.math3.analysis.function.Log().value)
    

    If you are using this same version

    This is the code of this class:

    public class Log implements UnivariateDifferentiableFunction, DifferentiableUnivariateFunction {
        /** {@inheritDoc} */
        public double value(double x) {
            return FastMath.log(x);
        }
    
        /** {@inheritDoc}
         * @deprecated as of 3.1, replaced by {@link #value(DerivativeStructure)}
         */
        @Deprecated
        public UnivariateFunction derivative() {
            return FunctionUtils.toDifferentiableUnivariateFunction(this).derivative();
        }
    
        /** {@inheritDoc}
         * @since 3.1
         */
        public DerivativeStructure value(final DerivativeStructure t) {
            return t.log();
        }
    
    }
    

    As you can see, you need to create an instance and then the value property is what actually executes the log function. Although you could create an instance outside the map and mark it as transient to avoid creating a new instance per RDD element. Or you can use this class in a mapPartition function to create only an instance per partition.