I've been trying to get data from sql server, load to dataframe, and write to parquet (which later I loaded to BigQuery or other source). I've got some problem with money data type, for example when the data in sql server:
100,000
but after writing to parquet it converts to:
100
(because the data size is big, I can't download to my local to make sure, but maybe write.parquet
change money to int, please correct me).
Here's part of my script:
df = spark.read.format("jdbc") \
.option("url", "jdbc:sqlserver://{myIP}:1433;instanceName={myInstance};database={myDB};") \
.option("dbtable", table_source) \
.option("user", user_source) \
.option("password", password_source) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
df.write.parquet("gs://output/sample.parquet")
Should I specify a scheme for each column? or are there some better approaches?
I believe this is because the ,
character is being treated as a decimal point. Can you confirm the data type in SQL server is numeric?
If the type in SQL server is numeric, then you can try manually removing the ,
and casting to double or string before writing to parquet. If its not numeric then you will have to do the casting anyways.