I am reading a large file from S3
in a Glue
job. Its a .txt
file which I convert to .csv
and read all the values in a particular column.
I want to leverage parallelism
of Glue
over here where the reading part can be taken as a task by glue workers
.
Do I need to programmatically split the file and then submit the small chunks to the workers
or does Spark
take care of the parallelism by itself and is smart enough to split the file by itself and distribute it to the workers
?
AWS Glue by itself splits file and distributes to worker nodes, as it uses Spark, so no manual splitting of file is required. However, you can leverage parallelism properties of Spark,
spark.sql.files.maxPartitionBytes: The maximum number of bytes to pack into a single partition when reading files. Default is 128 MB.
spark.sql.files.minPartitionNum: The suggested (not guaranteed) minimum number of partitions when reading files. Default is spark.default.parallelism which equals to two or the total number of cores in our cluster, whichever is bigger.
You can refer more in these pages: Salesforce Engineering Blog Apache Spark Doc