Search code examples
apache-sparkpysparkparquet

How to handle Money data type when writing to Parquet


I've been trying to get data from sql server, load to dataframe, and write to parquet (which later I loaded to BigQuery or other source). I've got some problem with money data type, for example when the data in sql server:

100,000

but after writing to parquet it converts to:

100

(because the data size is big, I can't download to my local to make sure, but maybe write.parquet change money to int, please correct me).

Here's part of my script:

df = spark.read.format("jdbc") \
    .option("url", "jdbc:sqlserver://{myIP}:1433;instanceName={myInstance};database={myDB};") \
    .option("dbtable", table_source) \
    .option("user", user_source) \
    .option("password", password_source) \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()

df.write.parquet("gs://output/sample.parquet")

Should I specify a scheme for each column? or are there some better approaches?


Solution

  • I believe this is because the , character is being treated as a decimal point. Can you confirm the data type in SQL server is numeric?

    If the type in SQL server is numeric, then you can try manually removing the , and casting to double or string before writing to parquet. If its not numeric then you will have to do the casting anyways.