I have been trying to SparkSubmit programs in Airflow, but spark files are in a different cluster (1**.1*.0.21) and airflow is in (1**.1*.0.35). I am looking for a detailed explanation of this topic with examples. I cant copy or download any xml files or other files to my airflow cluster.
When I try in SSH hook it says. Though I have many doubts using SSH Operator and BashOperator.
Broken DAG: [/opt/airflow/dags/s.py] No module named paramiko
I got the connection and here is my code and procedure.
import airflow
from airflow import DAG
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.operators.bash_operator import BashOperator
from datetime import datetime, timedelta
dag = DAG(dag_id = "spk", description='filer',
schedule_interval='* * * * *',
start_date = airflow.utils.dates.days_ago(2),
params={'project_source': '/home/afzal',
'spark_submit': '/usr/hdp/current/spark2-client/bin/spark-submit --principal hdfs-ivory@KDCAUTH.COM --keytab /etc/security/keytabs/hdfs.headless.keytab --master yarn --deploy-mode client airpy.py'})
templated_bash_command = """
cd {{ params.project_source }}
{{ params.spark_submit }}
"""
t1 = SSHOperator(
task_id="SSH_task",
ssh_conn_id='spark_21',
command=templated_bash_command,
dag=dag
)
and I also created a connection in 'Admin > Connections' in airflow
Conn Id : spark_21
Conn Type : SSH
Host : mas****p
Username : afzal
Password : *****
Port :
Extra :
The username and password is used to login to the desired cluster.