java scala apache-spark apache-spark-sql

java.lang.OutOfMemoryError: UTF16 String size exceeding default value

I was trying to load a tsv files from urls (max file size was 1.05 GB or 1129672402 Bytes)

I used java.net.URL for it.

But, it throwed the below error (for the largest one)-

java.lang.OutOfMemoryError: UTF16 String size is 1129672402, should be less than 1073741823

Is there any way to increase the default String size in spark or any other solution for processing this ?

def getGeoFeedsDataNew(spark: SparkSession, url: String, schema: StructType): DataFrame = {
    val inputStream = new URL(url).openStream()
    val reader = new BufferedReader(new InputStreamReader(inputStream))
    var data = reader.lines().collect(Collectors.joining("\n")).split("\\n").map("1\t".concat(_).concat("\t2")).map(_.split("\t"))
    inputStream.close()
    val size = data.length
    logger.info(s"records found: ${size-1}")
    if(size < 2){
      return spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
    }
    data = data.slice(1,size)
    val rowsRDD = spark.sparkContext.parallelize(data).map(row => Row.fromSeq(row.toSeq))
    val geoDataDF: DataFrame = spark.createDataFrame(rowsRDD, schema)
    return geoDataDF
  }

My current spark configs -

  "spark.driver.cores": "1",
  "spark.driver.memory": "30G",
  "spark.executor.cores": "8",
  "spark.executor.memory": "30G",
  "spark.network.timeout": "120",
  "spark.executor.instances": "8",
  "spark.rpc.message.maxSize": "1024",
  "spark.driver.maxResultSize": "2G",
  "spark.sql.adaptive.enabled": "true",
  "spark.sql.broadcastTimeout": "10000",
  "spark.sql.shuffle.partitions": "200",
  "spark.shuffle.useOldFetchProtocol": "true",
  "spark.hadoop.fs.s3a.committer.name": "magic",
  "spark.sql.adaptive.skewJoin.enabled": "true"

Solution

Unfortunately you seem to have hit: https://bugs.openjdk.org/browse/JDK-8190429 there is no way around this limitation. The same limitation would be there if you had strings inside a row field. Instead you must save to a file instead of a string and refer to that. (breaking strings up into different fields could work for a bit but you'd have a 2gb limit per row to fight against anyway, again no configuration can change this as both are fighting against byte arrays)