Search code examples
apache-sparkhdfshadoop-yarnprestohdp

Installing Presto on a VM cluster and connecting it to HDFS on a different Yarn cluster


we have an HDP 2.6.4 spark cluster with 10 linux worker machines.

The cluster runs spark applications over HDFS. The HDFS is installed on all the workers.

We wish to install presto that will query the HDFS of the cluster, however due to lack of CPU resources in the worker machines (only 32 cores per machine) the plan is to install presto outside of the cluster.

For that purpose we have several ESX, each ESX will have 2 VMs, and each VM will run a single presto server.

All the ESX machines will connected to the spark cluster via 10g network cards so that the two clusters will be in the same network.

My question is - can we install presto on the VM cluster and although the HDFS is not on the ESX cluster (but instead on the spark cluster)?

EDIT:

Fromt eh answer we got it seems that installing presto on VM is standard, so I'd like to clarify my question:

Presto has a configuration file named hive.properties under presto/etc.

Inside that file there’s a parameter named hive.config.resources with the following value:

/etc/hadoop/conf/presto-hdfs-site.xml,/etc/hadoop/conf/presto-core-site.xml

These files are HDFS config files, but since the VM cluster and the spark cluster (which contains the HDFS) are separate ones (the presto on the VM cluster should access the HDFS that resides on the spark cluster), the question is –

should these files be copied from the spark cluster to the VM cluster?


Solution

  • Regarding to your question - My question is - can we install presto on the VM cluster and although the HDFS is not on the ESX cluster (but instead on the spark cluster)?

    The answer is YES

    On this cluster that isn't co hosted with HDFS , don't forget to set the fowling parameter in hive.properties

    hive.force-local-scheduling=false