Search code examples
google-cloud-platformhive-metastoredataprocairflow-apigoogle-cloud-dataproc-metastore

Couldn't connect to dpms while creating dataproc using airflow operator


I have a service created for dataproc metastore(in same project as composer's) and trying to use it instead of my hive warehouse. I could successfully run this using gcloud commands but when I am trying to use any airflow operators like DataprocClusterCreateOperator or DataprocCreateClusterOperator, I couldn't pass this external dpms to my dataproc cluster. I see that it is accepting some arguments like 'dataproc_metastore_service' without any syntax error but it is not actually using this argument- when cluster is created, dpms field is 'none' in cluster configuration.

Code I am trying to execute:

create_cluster=DataprocClusterCreateOperator(
task_id='create_cluster',
cluster_name=CLUSTER_NAME,
project_id=PROJECT_ID,
region='us-east4',
zone='us-east4-b',
subnetwork_uri="projects/**************/shared-np-east-green-subnet-2",
internal_ip_only=True,
enable_component_gateway=True,
num_masters=1,
master_machine_type='n1-standard-4',
master_disk_size=30,
num_workers=2,
worker_machine_type='n1-standard-4',
worker_disk_size=30,
init_action_timeout='10m',
image_version='2.0-rocky8',
gcp_conn_id='custom_gcp_conn',
service_account_scopes=['https://www.googleapis.com/auth/cloud-platform'],
optional_components=['HIVE_WEBHCAT','ZOOKEEPER','DOCKER'],
labels={'type':'eph','resourceowner':'application'},
service_account="[email protected]",
properties={'dataproc:dataproc.components.deactivate':'hive-metastore','hive:hive.metastore.warehouse.dir':'gs://myproject-warehouse/db','dataproc:dataproc.logging.stackdriver.job.driver.enable':'True','dataproc:dataproc.logging.stackdriver.job.yarn.container.enable':'True','dataproc:dataproc.logging.stackdriver.enable':'True','dataproc:jobs.file-backed-output.enable':'True','dataproc:dataproc.monitoring.stackdriver.enable':'True',
'dataproc:metastore-config:dataproc-metastore-service':projects/common_project/locations/us-east4/services/custom_service_name'},
metadata=[("http-proxy","http://proxy.ebiz.example.com:9290"),
("email-smtp-host","exmp.example.com"),
("email-from-address","[email protected]"),
("mysql-root-password-secret-name","mysql-root-password,exmp-password-secret-name=exmp-password")],
idle_delete_ttl=300,
dag=dag
)

In the above code, I also tried 'dataproc_metastore_service = 'projects/common_project/locations/us-east4/services/custom_service_name' to pass the dpms argument but didn't work. As an alternate approach, I am defining inside properties arguments but no use

Please help me with any thoughts.


Solution

  • I got the answer. Pasting it here if it helps others

    CLUSTER_CONFIG = {
        "master_config": {
            "num_instances": 1,
            "machine_type_uri": "n1-standard-4",
            "disk_config": {"boot_disk_type": "pd-standard", "boot_disk_size_gb": 1024},
        },
        "worker_config": {
            "num_instances": 2,
            "machine_type_uri": "n1-standard-4",
            "disk_config": {"boot_disk_type": "pd-standard", "boot_disk_size_gb": 1024},
        },
        "metastore_config":{
            "dataproc_metastore_service": "projects*************-dpms"