The approache for precessing a large file on spark

When i process a large file on spark cluster the out of memory is occurred. I know i can extend the size of heap. But in more general case, it is not good method i think. I am curious splitting the large file into small files in batch is good choice. So we can process small files in batch instead of a large file.

Solution

I have encountered the OOM problem either.As spark uses the memory to compute,the data,the intermediate file and so on all stored in the memory.I think cache or persist will be helpful.You can set the storage level as MEMORY_AND_DISK_SER.

how to convert date 2017-sep-12 To 2017-09-12 in HIVE
pySpark Hadoop AWS s3 requester-pays.enabled config doesn't work
HBase Shell - org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server is not running yet
UnsatisfiedLinkError while writing to S3 using Staging S3A Committer on Windows
Why do I need to source bash_profile every time
Apache Spark: Get number of records per partition
Unable to exit Hive
can Configuration.set be used in the Mapper?
Loading Files in UDF
Error: `callbackHandler` may not be null when connecting to HDFS using Kerberos in Jakarta EE
how to tune out of memory exception spark
Can't connect from Spark to S3 - AmazonS3Exception Status Code: 400
How to delete and update a record in Hive
What is Google's Dremel? How is it different from Mapreduce?
how to set "api-version" dynamically in fs.azure.account.oauth2.msi.endpoint
NoClassDefFoundError: org/apache/parquet/conf/ParquetConfiguration
java.lang.NoClassDefFoundError: org/apache/hadoop/fs/StorageStatistics
Missing PutHDFS Processor in Apache NiFi 2.0.0
Apache Nifi: PutHDFS Processor issue - PutHDFS Failed to write to HDFS java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configurable
how to check which HDFS datanode ip is returned by namenode to spark?
How to use hadoop with laravel 5.2
java.lang.UnsupportedOperationException: 'posix:permissions'
What is the principle of "code moving to data" rather than data to code?
java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient
How to understand the result of yarn queue status
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
connect to host localhost port 22: Connection refused
Where is yarn.nodemanager.log-dirs in spark?
How to change date format in hive?
Parquet without Hadoop?