Search code examples
scalaapache-sparkrandomrddsample-data

How to work on small portion of big Data File in spark?


I have got big Data file loaded in Spark but wish to work on a small portion of it to run the analysis, is there any way to do that ?. I tried doing repartition but it brings a lot of reshuffling. Is there any good of way of processing the only small chunk of a Big file loaded in Spark?.


Solution

  • In short

    You can use sample() or randomSplit() transformations on RDD

    sample()

    /**
      * Return a sampled subset of this RDD.
      *
      * @param withReplacement can elements be sampled multiple times
      * @param fraction expected size of the sample as a fraction of this RDD's size
      *  without replacement: probability that each element is chosen; fraction must be [0, 1]
      *  with replacement: expected number of times each element is chosen; fraction must be 
      *  greater than or equal to 0
      * @param seed seed for the random number generator
      *
      * @note This is NOT guaranteed to provide exactly the fraction of the count
      * of the given [[RDD]].
      */
    
      def sample(
          withReplacement: Boolean,
          fraction: Double,
          seed: Long = Utils.random.nextLong): RDD[T]
    

    Example:

    val sampleWithoutReplacement = rdd.sample(false, 0.2, 2)
    

    randomSplit()

    /**
      * Randomly splits this RDD with the provided weights.
      *
      * @param weights weights for splits, will be normalized if they don't sum to 1
      * @param seed random seed
      *
      * @return split RDDs in an array
      */
    
    def randomSplit(
       weights: Array[Double],
       seed: Long = Utils.random.nextLong): Array[RDD[T]]
    

    Example:

    val rddParts = randomSplit(Array(0.8, 0.2)) //Which splits RDD into 80-20 ratio