Search code examples
apache-sparkpysparkdatabricksojdbc

SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 31 exceeds max precision 11


I want to read one oracle table with 35M rows and hundred columns.

extractDF = spark.read \
.format("jdbc") \
.option("dbTable", f"""(
    select tt.* from (
        select {ORIGIN_COLUMNS} from {ORIGIN_SCHEMA}.{ORIGIN_TABLE} t
    ) tt
    where tt.{ORIGIN_REF_COLUMN}<={int(origin_ref_column_value_max)}
)""") \
.option("driver", "oracle.jdbc.driver.OracleDriver" ) \
.option("url", ORACLE_URL) \
.option("user", ORACLE_USER) \
.option("password", ORACLE_PASSWORD) \
.option("fetchsize",f"{CONTROL_FETCH_SIZE}") \
.option("batchsize",f"{CONTROL_BATCH_SIZE}") \
.option("partitionColumn", ORIGIN_REF_COLUMN) \
.option("numPartitions", f"{numPartitions}") \
.option("lowerBound", f"{int(origin_ref_column_value_min)}") \
.option("upperBound", f"{int(origin_ref_column_value_max)+1}") \
.load()

extractDF.collect()  # --> Raise SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 31 exceeds max precision 11

I found about this error on web when max precision is above 38, but not found anything with value below it.

The table schema there is several columns is like it:

root
 |-- COD_1: decimal(10,0) (nullable = true)
 |-- COD_2: decimal(11,2) (nullable = true)
 |-- COD_3: decimal(30,2) (nullable = true)
 |-- COD_4: decimal(7,2) (nullable = true)

I changed the column types (with some research tips that I found) using the code below and I changed all decimal datatypes to Decimal(38,original_scale):

decimalCols = [it for it in list(filter(lambda i: 'decimal' in i[1], extractDF.dtypes))]
for c in decimalCols:
    colScale = re.sub(r"decimal\(\d+,(\d+)\)", r"\1", c[1])
    extractDF = extractDF.withColumn(c[0], col(c[0]).cast(DecimalType(38, int(colScale))))

But even after change all datatypes, I got the same precision error on extractDF.collect().

I tried find in the table if there is some weird data, but I not found anything.


Spark 3.3.2, Scala 2.12, ojdbc8 23.2.0.0

Other courious thing about it:

I added rownum in the sql to track the 'corrupted' data:

  • query where rownum < 42k  -> works
  • query where rownum < 44k  --> error
  • query where rownum between 42k and 44k --> works. An error would raise if there was a problem with the data

I don't known if it is an spark bug, ojdbc bug, my mind bug or some divine intervention... Can you help me?


Solution

  • The oracle table had corrupted data. One column number(11,0) with the value: -10100000000000000000000000000000.

    We not even able to query by this number. The only way to get this data was find using difference <>.

    This problem happened only in replication database. So, be warned about it.