Search code examples
oracle-databaseapache-sparkpysparkamazon-redshiftaws-glue

How to resolve the following AWS Glue error while writing to Redshift using Spark: "ORA-01722: invalid number"?


I am trying to read from an Oracle database and write to a Redshift table using PySpark.

# Reading data from Oracle
oracle_df = spark.read \
    .format("jdbc") \
    .option("url", oracle_url) \
    .option("dbtable", oracle_table) \
    .option("user", oracle_user) \
    .option("password", oracle_password) \
    .option("driver", oracle_driver_name) \
    .load()

Selecting only a few columns:

selected_df = oracle_df.select("Col1", "Col2")

Printing schema:

selected_df.printSchema()

|-- Col1: decimal(38,10) (nullable = true)
|-- Col2: decimal(38,10) (nullable = true)

I have converted all the columns to string:

for col_name in selected_df.schema.names:
    selected_df = selected_df.withColumn(
        col_name, 
        coalesce(selected_df[col_name].cast(StringType()), lit(""))
    )

 |-- Col1: string (nullable = false)
 |-- Col2: string (nullable = false)

When I write the Dataframe to Redshift, I get the error o323.save. ORA-01722: invalid number.

# Redshift Connection Configuration
redshift_connection_name = 'redshift_connection'
redshift_connection_options = glueContext.extract_jdbc_conf(redshift_connection_name)
redshift_url = redshift_connection_options["fullUrl"]
redshift_user = redshift_connection_options["user"]
redshift_password = redshift_connection_options["password"]
redshift_driver_name = "com.amazon.redshift.jdbc42.Driver"
redshift_table = "test_schema.main_last"

# Writing data to Redshift
selected_df.write \
    .format("jdbc") \
    .option("driver", redshift_driver_name) \
    .option("url", redshift_url) \
    .option("dbtable", redshift_table) \
    .option("user", redshift_user) \
    .option("password", redshift_password) \
    .option("tempdir", "s3://glue-temp/") \
    .mode("overwrite") \
    .save()

I have made sure the Redshift DDL:

CREATE TABLE test_schema.main_last (
    Col1 character varying(256) ENCODE lzo,
    Col2 character varying(256) ENCODE lzo,

Solution

  • ORA-01722: invalid number is ORACLE error caused by failed attempted conversion to number either explicite or implicite. What you do after (outside Oracle), in this case, is irrelevant. Your Oracle's Select command returned this error.

    Where the error comes from:

    # Reading data from Oracle
    oracle_df = spark.read \
        .format("jdbc") \
        .option("url", oracle_url) \
        .option("dbtable", oracle_table) \
        .option("user", oracle_user) \
        .option("password", oracle_password) \
        .option("driver", oracle_driver_name) \
        .load()
    

    The above code runs "Select [column list] From oracle_table" statement in Oracle database and that statement errors as ORA-01722 Invalid number.

    Possible solution:
    Since the error is raised by reading data from Oracle, make sure you control what you are reading and how. Is it possible that you are loading the data from an Oracle view, not a table, it looks just like there is an error in the view definition and/or view data that you are reading from. If that is the case then it will be more complicated to solve this - unless you have access to the view.

    Use query option instead of dbtable.

    # Reading data from Oracle
    # Use query option instead of dbtable
    #
    # query_0 test query to be sure that the logic works
    query_0 = "select 'A' as COL1, 1 as COL2 from dual"
    #
    # query_1 reads original data
    query_1 = "select COL1, COL2 from oracle_table"
    #
    # query_2 converts original data to char on Oracle db side
    query_2 = "select To_Char(COL1) as COL1, To_Char(COL2) as COL2 from oracle_table"
    
    oracle_df = spark.read \
        .format("jdbc") \
        .option("url", oracle_url) \
        .option("query", query_2)\
        .option("user", oracle_user) \
        .option("password", oracle_password) \
        .option("driver", oracle_driver_name) \
        .load()
    

    The goal is either to isolate "corrupted" data or to build a workaround (with Oracle's TO_CHAR() function) and get your data loaded.

    Start with query_0 just to make sure that the structure, driver, connection and the logic works ok.

    If you want to find out which column (if it is a column) throws the error, start querying with just one column with original data and add the others (one by one) till it errors (query_1) meaning the column added last is corrupted. If all particular columns raise the same error then you are, probably, reading a faulty view.

    If you want to load the columns data as strings then declare your query doing the conversion on the Oracle database side (query_2). Converted values should be ok, just take care of columns that could be of DATE or TIMESTAMP datatype in oracle. If this fails too, again, you are, probably, reading a faulty view.

    Handle fetched data after they were successfuly loaded.

    If you are still getting the same error then the problem is not in the data selected but, most likely, you are reading an Oracle view (almost sure it's a view) with an error not in selection list (the data) of the view - it is in some other part of Select statement such as Where or Having or somewhere else. To solve this someone should correct the view definition on Oracle side, test it and make sure it works ok.

    Here are just a few common examples of Select commands ending with ORA-01722 error

    -- depending on your data - explicite conversion
    Select To_Number('a23') "COL_NAME" From Dual;
    -- ORA-01722: invalid number
    
    Select Cast('a23' as Number) "COL_NAME" From Dual;
    -- ORA-01722: invalid number
    
    -- depending on your data and/or data manipulation - implicite conversion
    Select 'a23' * 3 "COL_NAME" From Dual;
    -- ORA-01722: invalid number
    
    -- depending on your db settings of NLS parameters - if decimal point is comma, but dot is used when selecting the data
    Select '2.3' * 3 "COL_NAME" From Dual;
    -- ORA-01722: invalid number
    

    Dealing with DATE datatype and date manipulations or date math could also couse this error

    FAULTY VIEW SAMPLE 1 ( COL1 data ) :

    • query_1 from above - select columns one by one till it errors
    • last column added ( COL1 ) is corrupted but you can fetch all the others
    Create Table TEST_TBL AS 
        ( Select 'A' as A_LETTER, 1 as A_NUMBER, To_Date('01.01.2024', 'dd.mm.yyyy') as A_DATE From Dual Union All
           Select 'B', 2, To_Date('02.01.2024', 'dd.mm.yyyy') From Dual Union All
           Select 'C', 3, To_Date('03.01.2024', 'dd.mm.yyyy') From Dual
         ) ;
    -- Table TEST_TBL created.
         
    Create View TEST_VIEW AS
    Select A_LETTER, A_NUMBER, A_DATE, 
           A_LETTER + A_NUMBER as COL1    -- expresion when run against the data will raise the error
    From   TEST_TBL;
    -- View TEST_VIEW created.
    

    When you select any column except COL1 there will be no error

    Select A_LETTER From TEST_VIEW;
    /*
    A_LETTER
    ---------
    A
    B
    C         */
    
    Select A_LETTER, A_NUMBER, A_DATE From TEST_VIEW;
    /*
    A_LETTER   A_NUMBER A_DATE  
    -------- ---------- --------
    A                 1 01.01.24
    B                 2 02.01.24
    C                 3 03.01.24  */
    

    ... when COL1 is selected the error is raised ...

    Select A_LETTER, A_NUMBER, A_DATE, COL1 From TEST_VIEW;
    -- ORA-01722: invalid number
    

    FAULTY VIEW SAMPLE 2 ( Where clause ) :

    • query_1 and query_2 both raise error for every column selected
    • fault is not in the data selection list (it is in Where clause)
    • you can not fetch anything without intervention in view's code or maybe in underlaying table(s) data if the data are the issue
    Create View TEST_VIEW AS
    Select A_LETTER, A_NUMBER, A_DATE  
    From   TEST_TBL
    Where  A_LETTER + A_NUMBER > 1;
    -- View TEST_VIEW created.
    

    Selecting any column from such a view will raise the error.

    Select A_LETTER From TEST_VIEW;
    -- ORA-01722: invalid number
    

    FAULTY VIEW SAMPLE 3 - with data issue ( in column A_LETTER ) :

    • it could be usefull to know the data from underlaying table(s)
    • the row(s) and or column(s) with the issue could be excluded
    • all other data could be fetched

    If the underlaying table data is like below:

    Select * From TEST_TBL;
    /*
    A_LETTER A_NUMBBER A_DATE
    -------- --------- ----------
    11               1  01.01.24
    12               2  02.01.24
    13C              3  03.01.24    */
    

    ... here A_LETTER column has values that could be implicitely converted to number (except the last row).
    If the view is defined like this:

    Create View TEST_VIEW AS
    Select A_LETTER, A_NUMBER, A_DATE, A_LETTER + A_NUMBER as COL1  
    From   TEST_TBL
    Where  A_LETTER + A_NUMBER > 1;
    -- View TEST_VIEW created.
    

    ... and if you try to select all the data - an error will be raised:

    Select  A_LETTER, A_NUMBER, A_DATE, COL1
    From    TEST_VIEW
    -- ORA-01722: invalid number
    

    ... but if you know/suspect that column A_LETTER could have some incorrect data - you can exclude such rows and fetch all the remaining data (using ... Default value On Conversion error when attempting to convert char value into number)

    Select  A_LETTER, A_NUMBER, A_DATE, COL1
    From    TEST_VIEW
    Where   To_Number(A_LETTER Default 0 ON Conversion Error) != 0
    /*
    A_LETTER   A_NUMBER A_DATE         COL1
    -------- ---------- -------- ----------
    11                1 01.01.24         12
    12                2 02.01.24         14    */