Search code examples
javaapache-sparkhadoopioapache-spark-sql

File with single Line around 4G to load into Spark


I am trying to load a file which is a single line, there are no new line charters in the entire File so technical single line size is the size of the file. I tried to use the below code to load the data.

val data= spark.sparkContext.textFile("location") 
data.count 

It is not able to return any value.

Tried to read the file as string with the following Code, Trying to write in java code.

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.fs.FileSystem
val inputPath = new Path("File")
val conf = spark.sparkContext.hadoopConfiguration
val fs = FileSystem.get(conf)
  val inputStream = fs.open(inputPath)
import java.io.{BufferedReader, InputStreamReader}
val readLines = new BufferedReader(new InputStreamReader(inputStream)).readLine()

The JVM is getting exited with following error.

ava HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x00007fcb6ba00000, 2148532224, 0) failed; error='Cannot allocate memory' (errno=12)

There is insufficient memory for the Java Runtime Environment to continue. Native memory allocation (mmap) failed to map 2148532224 bytes for committing reserved memory.

The Problem is entire data is in single line, spark using \n to identify the new record(new line). As there is \n it is trying to load into single line which creating the memory issues

I am ok to split that long string based on length, add new line character for every 200 charcter (0,200) first line. (200,400) is second line.

Sample Input

This is Achyuth This is ychyath This is Mansoor ... .... this line size is more than 4 gigs.

Output

This is Achyuth
This is ychyath
This is Mansoor
. 
. 
.

Solution

  • This approach works if the file size is a multiple of the split size and the character encoding is fixed-length (ASCII, UTF-16, UTF-32, no code points above 127 in UTF-8 or similar...).

    Given file

    This is AchyuthThis is ychyathThis is Mansoor
    
    val rdd = spark
      .sparkContext
      .binaryRecords(path, 15)
      .map(bytes => new String(bytes))
    val df = spark.createDataset(rdd)
    df.show()
    

    Output:

    +---------------+
    |          value|
    +---------------+
    |This is Achyuth|
    |This is ychyath|
    |This is Mansoor|
    +---------------+