I have a SQL Server table that has a different schema than my dataframe. I would like to select some columns from my dataframe and "insert into" the table the values I selected.
Basically something similar to the code below but in pyspark:
INSERT INTO Cust_Diff_Schema_tbl
(acct_num, name)
SELECT account_no, name
FROM customers
WHERE customer_id > 5000;
I can read the data using jdbc using spark.read. Just like below:
df_s3 = spark.read.format("jdbc")\
.option("driver", db_driver_name)\
.option("url", db_url+ ":1433;databaseName="+stage_db)\
.option("dbtable", tbl_name)\
.option("query", """(select * from customers)""")\
.option("user", db_username)\
.option("password", db_password)\
.load()
df_s3.printSchema()
df_s3.show(20)
To write/append the data to the table with the selected values, I believe I can still use "df_s3.write" but I need an example on how to use the insert statement using ".option" function or another approach if this does not work.
Thanks in advance.
//create dataframe
val df = //fetch from db,read file or other options
df.write.format("jdbc")
.option("numPartitions", 20)
.option("batchsize", 10000)
.option("truncate", "true")
.option("url", "jdbcURL")
.option("driver", "Driver name")
.option("dbtable", "tablename")
.mode("append")
.save()