I have 30GB ORC files ( 24 parts * 1.3G) in s3. I am using spark to read this orc and do some operations. But from the logs what I observed was even before doing any operation, spark is opening and reading all 24 parts from s3 (Taking 12 min just to read files ). But my concern here is that all this read operations are happening only in driver and executors are all idle at this time.
Can someone explain me why is happening? Is there any way I can utilize all executors for reading as well?
Does the same apply for parquet as well ?
Thanks in advance.
Have you provided the schema of your data ?
If not, Spark tries to get the schema of all the files, and then proceeds with the execution.