Search code examples
google-cloud-storageprestoorcgoogle-compute-engine

Presto on Preemptible GCE instances


I am running an instance group of 20 Preemptible GCE instance to read ORC files on Google storage, The data partitioned by hour, each hour about 2GB.

  1. What type of instances should i use ?
  2. How many of the Ram should be used by the JVM ?
  3. I am using autoscale configuration of 80% CPU and 10 minute cooldown, Is there more subtitle config for Presto ?
  4. Is there a solution for servers shutdowns, due to lack of resources ?

Partial responses will be appreciated as well.


Solution

  • As 0.199 version of PrestoDB there's no google cloud storage connector for Presto, which makes impossible to query GCS data.

    Regarding hardware requirements, I'll cite Terada doc here.

    Memory

    You should allocate a minimum of 16GB of RAM per node for Presto. But recommend 64GB for most production workloads.

    Network Bandwidth

    It is recommended to have 10 Gigabit Ethernet between all the nodes in the cluster.

    Other Recommendations

    Presto can be installed on any normally configured Hadoop cluster. YARN should be configured to account for resources dedicated to Presto. For example, if a node has 64GB of RAM, perhaps you would normally allocate 60GB to YARN. If you install Presto on that node and give Presto 32GB of RAM, then you should subtract 32GB from the 60GB and let YARN only allocate 28GB per node. An optimized configuration might choose to have separate Presto and Hadoop nodes. The optimized configuration allows you to give more memory to Presto, and thus perform larger join queries, for example.