I am really hoping to use Presto in an ETL pipeline on AWS EMR, but I am having trouble configuring it to fully utilize the cluster's resources. This cluster would exist solely for this one query, and nothing more, then die. Thus, I would like to claim the maximum available memory for each node and the one query by increasing query.max-memory-per-node
and query.max-memory
. I can do this when I'm configuring the cluster by adding these settings in the "Edit software settings" box of the cluster creation view in the AWS console. But the Presto server doesn't start, reporting in the server.log file an IllegalArgumentException, saying that max-memory-per-node exceeds the useable heap space (which, by default, is far too small for my instance type and use case).
I have tried to use the session setting set session resource_overcommit=true
, but that only seems to override query.max-memory, not query.max-memory-per-node, because in the Presto UI, I see that very little of the available memory on each node is being used for the query.
Through Google, I've been led to believe that I need to also increase the JVM heap size by changing the -Xmx and -Xms properties in /etc/presto/conf/jvm.config, but it says here (http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto.html) that it is not possible to alter the JVM settings in the cluster creation phase.
To change these properties after the EMR cluster is active and the Presto server has been started, do I really have to manually ssh into each node and alter jvm.config and config.properties, and restart the Presto server? While I realize it'd be possible to manually install Presto with a custom configuration on an EMR cluster through a bootstrap script or something, this would really be a deal-breaker.
Is there something I'm missing here? Is there not an easier way to make Presto allocate all of a cluster to one query?
As advertised, increasing query.max-memory-per-node
, and also by necessity the -Xmx
property, indeed cannot be achieved on EMR until after Presto has already started with the default options. To increase these, the jvm.config and config.properties found in /etc/presto/conf/ have to be changed, and the Presto server restarted on each node (core and coordinator).
One can do this with a bootstrap script using commands like
sudo sed -i "s/query.max-memory-per-node=.*GB/query.max-memory-per-node=20GB/g" /etc/presto/conf/config.properties
sudo restart presto-server
and similarly for /etc/presto/jvm.conf. The only caveats are that one needs to include the logic in the bootstrap action to execute only after Presto has been installed, and that the server on the coordinating node needs to be restarted last (and possibly with different settings if the master node's instance type is different than the core nodes).
You might also need to change resources.reserved-system-memory
from the default by specifying a value for it in config.properties. By default, this value is .4*(Xmx value), which is how much memory is claimed by Presto for the system pool. In my case, I was able to safely decrease this value and give more memory to each node for executing the query.