I'm working on a benchmarking task, and I need to generate millions of rows of event json.
Here is my sample code:
def generateEntry() = {
s"""
|{
| "memberId": ${java.util.UUID.randomUUID.toString},
| "first_name": ${nameRandomizer},
| "last_name": ${nameRandomizer
|}""".stripMargin
}
// Generate 1000000 rows of Json String with fields: memberId, first_name, last_name
val entryList = mutable.ListBuffer[String]()
for (_ <- 1 to 1000000) {
entryList += generateEntry()
}
val inputRDD: RDD[String] = sc.parallelize(entryList.result())
However this is causing an error:
Java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at scala.StringContext.standardInterpolator(StringContext.scala:126)
at scala.StringContext.s(StringContext.scala:95)
I am coding in spark by the way. I tried doing this by batch, but error still seem to occur. Please let me know, or provide sample code that I can use as a guide to fix this. Thanks!
List buffer is not needed. You can just map a Spark range to your function:
val inputRDD: RDD[String] = spark.range(1000000).rdd.map(x => generateEntry())