I am trying to convert the sql server tables to .csv format through below code in pyspark.
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://server:port").option("databaseName","database").option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable","table").option("user","uid").option("password","pwd").load()
df.registerTempTable("test")
df.write.format("com.databricks.spark.csv").save("full_path")
so, if I want to convert multiple tables, I need to write multiple Data Frames.So, to avoid it, I wanted to take command line argument for the database name, and the table name from the users while iterating on the data frame through for loop.
Is it even possible? If yes, can somebody guide me on how to do it through spark-submit?
Just do this change to spark-submit command and your code:
test.py
import sys
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
db_name = sys.argv[1]
table_name = sys.argv[2]
file_name = sys.argv[3]
df = sqlContext.read.format("jdbc").option("url","jdbc:sqlserver://server:port").option("databaseName",db_name).option("driver","com.microsoft.sqlserver.jdbc.SQLServerDriver").option("dbtable",table_name).option("user","uid").option("password","pwd").load()
df.registerTempTable("test")
df.write.format("com.databricks.spark.csv").save(file_name)
Spark-submit command:
spart-submit test.py <db_name> <table_name> <file_name>