I am running a data retrieval query connecting to an RDS Aurora MySQL instance (of size r5.2xlarge) that runs the InnoDB engine internally. The data size is over 600 million records. The query is retrieving few columns from the table and is being run via PySpark on EMR. I'm adding partition info to the JDBC read call to parallelized the reads. The partition column is an auto incremental numeric column that has been indexed to fetch records faster. As a result, WHERE clauses are being generated by the Spark engine dynamically based on the values of the partition column. Here is the query:
SELECT <subset-of-columns>
FROM <db>.<table>
WHERE <index-column> BETWEEN <lower-bound> AND <max-value-of-the-partition>;
I'm specifying the number of partitions as 60. Hence, 60 such queries each with different values of WHERE clause is being executed. This ran fine a few months ago but not now. The CPU utilization is going over 90% and Spark is only able to run 50 tasks (which is equal to 50 partitions) at max. I'm not able to figure out what's causing this issue now.
I implemented a custom solution for this. It was my bad I forgot to post it here. Nevertheless, I am doing it now hoping it may help others who runs into a similar problem in future.
Solution: I came up with a fixed configuration as specified below to restrict the number of connections that are established to the source database server:
The standard/default values I set for my use case are as follows: Number of partitions in a batch: 10 Size of each partition/number of rows processed in each partition: 1 million Number of batches: decided dynamically based on the minimum and maximum values of the auto increment ID column. Imagine the values are 1 and 23 Crore. Then the number of batches would be 23 since 10 million rows are processed in each batch.
Here is the formula: Number of partitions in a batch (10) * number of rows processed in each partition (1 million) = one batch size (10 million)
I kept these properties configurable outside the code and passed it to the code which runs in a loop for each batch. It worked like a charm for all types of data volume, be it smaller (fewer KBs or MBs of data), medium (fewer GBs) and larger (several GBs to TBs).
Cheers.