I was trying to load a tsv files from urls (max file size was 1.05 GB or 1129672402 Bytes)
I used java.net.URL for it.
But, it throwed the below error (for the largest one)-
java.lang.OutOfMemoryError: UTF16 String size is 1129672402, should be less than 1073741823
Is there any way to increase the default String size in spark or any other solution for processing this ?
def getGeoFeedsDataNew(spark: SparkSession, url: String, schema: StructType): DataFrame = {
val inputStream = new URL(url).openStream()
val reader = new BufferedReader(new InputStreamReader(inputStream))
var data = reader.lines().collect(Collectors.joining("\n")).split("\\n").map("1\t".concat(_).concat("\t2")).map(_.split("\t"))
inputStream.close()
val size = data.length
logger.info(s"records found: ${size-1}")
if(size < 2){
return spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema)
}
data = data.slice(1,size)
val rowsRDD = spark.sparkContext.parallelize(data).map(row => Row.fromSeq(row.toSeq))
val geoDataDF: DataFrame = spark.createDataFrame(rowsRDD, schema)
return geoDataDF
}
My current spark configs -
"spark.driver.cores": "1",
"spark.driver.memory": "30G",
"spark.executor.cores": "8",
"spark.executor.memory": "30G",
"spark.network.timeout": "120",
"spark.executor.instances": "8",
"spark.rpc.message.maxSize": "1024",
"spark.driver.maxResultSize": "2G",
"spark.sql.adaptive.enabled": "true",
"spark.sql.broadcastTimeout": "10000",
"spark.sql.shuffle.partitions": "200",
"spark.shuffle.useOldFetchProtocol": "true",
"spark.hadoop.fs.s3a.committer.name": "magic",
"spark.sql.adaptive.skewJoin.enabled": "true"
Unfortunately you seem to have hit: https://bugs.openjdk.org/browse/JDK-8190429 there is no way around this limitation. The same limitation would be there if you had strings inside a row field. Instead you must save to a file instead of a string and refer to that. (breaking strings up into different fields could work for a bit but you'd have a 2gb limit per row to fight against anyway, again no configuration can change this as both are fighting against byte arrays)