When i process a large file on spark cluster the out of memory is occurred. I know i can extend the size of heap. But in more general case, it is not good method i think. I am curious splitting the large file into small files in batch is good choice. So we can process small files in batch instead of a large file.
I have encountered the OOM problem either.As spark uses the memory to compute,the data,the intermediate file and so on all stored in the memory.I think cache or persist will be helpful.You can set the storage level as MEMORY_AND_DISK_SER.