Search code examples
apache-sparkpysparkairflowremote-serverspark-submit

To run Spark Submit programs from a different cluster (1**.1*.0.21) in airflow (1**.1*.0.35). How to connect remotely other cluster in airflow


I have been trying to SparkSubmit programs in Airflow, but spark files are in a different cluster (1**.1*.0.21) and airflow is in (1**.1*.0.35). I am looking for a detailed explanation of this topic with examples. I cant copy or download any xml files or other files to my airflow cluster.

When I try in SSH hook it says. Though I have many doubts using SSH Operator and BashOperator.

Broken DAG: [/opt/airflow/dags/s.py] No module named paramiko

Solution

  • I got the connection and here is my code and procedure.

    import airflow
    from airflow import DAG
    from airflow.contrib.operators.ssh_operator import SSHOperator
    from airflow.operators.bash_operator import BashOperator
    from datetime import datetime, timedelta
    
    
    dag = DAG(dag_id = "spk", description='filer',
              schedule_interval='* * * * *',
              start_date = airflow.utils.dates.days_ago(2),
              params={'project_source': '/home/afzal',
                      'spark_submit': '/usr/hdp/current/spark2-client/bin/spark-submit --principal hdfs-ivory@KDCAUTH.COM --keytab /etc/security/keytabs/hdfs.headless.keytab --master yarn --deploy-mode client airpy.py'})
    
    templated_bash_command = """
                cd {{ params.project_source }}
                {{ params.spark_submit }} 
                """
    
    t1 = SSHOperator(
           task_id="SSH_task",
           ssh_conn_id='spark_21',
           command=templated_bash_command,
           dag=dag
           )
    

    and I also created a connection in 'Admin > Connections' in airflow

    Conn Id : spark_21
    Conn Type : SSH
    Host : mas****p
    Username : afzal
    Password : ***** 
    Port  :
    Extra  :
    

    The username and password is used to login to the desired cluster.