I want to read one oracle table with 35M rows and hundred columns.
extractDF = spark.read \
.format("jdbc") \
.option("dbTable", f"""(
select tt.* from (
select {ORIGIN_COLUMNS} from {ORIGIN_SCHEMA}.{ORIGIN_TABLE} t
) tt
where tt.{ORIGIN_REF_COLUMN}<={int(origin_ref_column_value_max)}
)""") \
.option("driver", "oracle.jdbc.driver.OracleDriver" ) \
.option("url", ORACLE_URL) \
.option("user", ORACLE_USER) \
.option("password", ORACLE_PASSWORD) \
.option("fetchsize",f"{CONTROL_FETCH_SIZE}") \
.option("batchsize",f"{CONTROL_BATCH_SIZE}") \
.option("partitionColumn", ORIGIN_REF_COLUMN) \
.option("numPartitions", f"{numPartitions}") \
.option("lowerBound", f"{int(origin_ref_column_value_min)}") \
.option("upperBound", f"{int(origin_ref_column_value_max)+1}") \
.load()
extractDF.collect() # --> Raise SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 31 exceeds max precision 11
I found about this error on web when max precision is above 38, but not found anything with value below it.
The table schema there is several columns is like it:
root
|-- COD_1: decimal(10,0) (nullable = true)
|-- COD_2: decimal(11,2) (nullable = true)
|-- COD_3: decimal(30,2) (nullable = true)
|-- COD_4: decimal(7,2) (nullable = true)
I changed the column types (with some research tips that I found) using the code below and I changed all decimal datatypes to Decimal(38,original_scale):
decimalCols = [it for it in list(filter(lambda i: 'decimal' in i[1], extractDF.dtypes))]
for c in decimalCols:
colScale = re.sub(r"decimal\(\d+,(\d+)\)", r"\1", c[1])
extractDF = extractDF.withColumn(c[0], col(c[0]).cast(DecimalType(38, int(colScale))))
But even after change all datatypes, I got the same precision error on extractDF.collect()
.
I tried find in the table if there is some weird data, but I not found anything.
Spark 3.3.2, Scala 2.12, ojdbc8 23.2.0.0
Other courious thing about it:
I added rownum in the sql to track the 'corrupted' data:
I don't known if it is an spark bug, ojdbc bug, my mind bug or some divine intervention... Can you help me?
The oracle table had corrupted data. One column number(11,0) with the value: -10100000000000000000000000000000
.
We not even able to query by this number. The only way to get this data was find using difference <>
.
This problem happened only in replication database. So, be warned about it.