Search code examples
apache-kafkaprestogoogle-cloud-dataproc

Presto in Dataproc: configure a Kafka catalog


I am trying to create a dataproc cluster with Presto as optional components and I would like to add a kafka catalog to it. Following https://cloud.google.com/dataproc/docs/concepts/components/presto and https://prestodb.io/docs/current/connector/kafka.html#configuration-properties I am using the following command:

gcloud beta dataproc clusters create mycluster \
    --region us-central1 \
    --no-address \
    --zone us-central1-a \
    --single-node \
    --master-machine-type n1-standard-4 \
    --master-boot-disk-size 500 \
    --project myproject \
    --optional-components=PRESTO \
    --enable-component-gateway \
    --properties="presto-catalog:kafkastream.connector.name=kafka,presto-catalog:kafkastream.kafka.tables-names=topicname,presto-catalog:kafkastream.kafka.nodes=kafkavm:9092,presto-catalog:kafkastream.kafka.default-schema=default,presto-catalog:kafkastream.kafka.hide-internal-columns=false"
   

So, basically I want to set the properties to install catalog called kafkastream that connect to a kafka vm on port 9092 and creates a table default.topicname.

However, when I try to create the cluster the status goes to error. In the log I found something related with StructuredError{presto, Component presto failed to activate. Other errors in the log are

google-dataproc-startup[1129]: activate-component-presto[2447]: Query 12345 failed: Presto server is still initializing
google-dataproc-startup[1129]: activate-component-presto[2447]: 'get_node_information' attempt 6 failed! Sleeping 10s.
google-dataproc-startup[1129]: activate-component-presto[2447]: Error running command: java.net.ConnectException: Failed to connect to localhost/0:0:0:0:0:0:0:1:8060

If I remove the properties part Presto works perfectly.

which is the right way to set a kafka catalog? Could someone helps me? I cannot found information related to this question neither in other stackoverflow topics nor online.


Solution

  • The --properties feature for the optional component Presto seems to have a bug and it does not work as expected. However, I have found a way to set up a kafka catalog via an initialisation script init-script.sh in a GCS bucket:

    #init-script.sh
    function add_kafka-catalog() {
      cat > /etc/presto/conf/catalog/kafka.properties <<EOF
    connector.name=kafka
    kafka.nodes=my-vm:9092
    kafka.table-names=my-topic
    kafka.hide-internal-columns=false
    EOF
    }
    
    # Restart presto to read new catalogs
    function restart_presto() {
        sudo /usr/lib/presto/bin/launcher restart
    }
    
    function main() {
    add_kafka-catalog
    restart_presto
    }
    main
    

    and launching the cluster via

    gcloud beta dataproc clusters create mycluster \
        --region us-central1 \
        --no-address \
        --zone us-central1-a \
        --single-node \
        --master-machine-type n1-standard-4 \
        --master-boot-disk-size 500 \
        --project myproject \
        --optional-components=PRESTO \
        --enable-component-gateway
        --initialization-actions 'gs://mybucket/init-script.sh