I am trying to create a dataproc cluster with Presto as optional components and I would like to add a kafka catalog to it. Following https://cloud.google.com/dataproc/docs/concepts/components/presto and https://prestodb.io/docs/current/connector/kafka.html#configuration-properties I am using the following command:
gcloud beta dataproc clusters create mycluster \
--region us-central1 \
--no-address \
--zone us-central1-a \
--single-node \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--project myproject \
--optional-components=PRESTO \
--enable-component-gateway \
--properties="presto-catalog:kafkastream.connector.name=kafka,presto-catalog:kafkastream.kafka.tables-names=topicname,presto-catalog:kafkastream.kafka.nodes=kafkavm:9092,presto-catalog:kafkastream.kafka.default-schema=default,presto-catalog:kafkastream.kafka.hide-internal-columns=false"
So, basically I want to set the properties
to install catalog called kafkastream
that connect to a kafka vm on port 9092
and creates a table default.topicname
.
However, when I try to create the cluster the status goes to error. In the log I found something related with StructuredError{presto, Component presto failed to activate
.
Other errors in the log are
google-dataproc-startup[1129]: activate-component-presto[2447]: Query 12345 failed: Presto server is still initializing
google-dataproc-startup[1129]: activate-component-presto[2447]: 'get_node_information' attempt 6 failed! Sleeping 10s.
google-dataproc-startup[1129]: activate-component-presto[2447]: Error running command: java.net.ConnectException: Failed to connect to localhost/0:0:0:0:0:0:0:1:8060
If I remove the properties
part Presto works perfectly.
which is the right way to set a kafka catalog? Could someone helps me? I cannot found information related to this question neither in other stackoverflow topics nor online.
The --properties
feature for the optional component Presto seems to have a bug and it does not work as expected. However, I have found a way to set up a kafka catalog via an initialisation script init-script.sh
in a GCS bucket:
#init-script.sh
function add_kafka-catalog() {
cat > /etc/presto/conf/catalog/kafka.properties <<EOF
connector.name=kafka
kafka.nodes=my-vm:9092
kafka.table-names=my-topic
kafka.hide-internal-columns=false
EOF
}
# Restart presto to read new catalogs
function restart_presto() {
sudo /usr/lib/presto/bin/launcher restart
}
function main() {
add_kafka-catalog
restart_presto
}
main
and launching the cluster via
gcloud beta dataproc clusters create mycluster \
--region us-central1 \
--no-address \
--zone us-central1-a \
--single-node \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 500 \
--project myproject \
--optional-components=PRESTO \
--enable-component-gateway
--initialization-actions 'gs://mybucket/init-script.sh