I am trying to read from an Oracle
database and write to a Redshift
table using PySpark
.
# Reading data from Oracle
oracle_df = spark.read \
.format("jdbc") \
.option("url", oracle_url) \
.option("dbtable", oracle_table) \
.option("user", oracle_user) \
.option("password", oracle_password) \
.option("driver", oracle_driver_name) \
.load()
Selecting only a few columns:
selected_df = oracle_df.select("Col1", "Col2")
Printing schema:
selected_df.printSchema()
|-- Col1: decimal(38,10) (nullable = true)
|-- Col2: decimal(38,10) (nullable = true)
I have converted all the columns to string:
for col_name in selected_df.schema.names:
selected_df = selected_df.withColumn(
col_name,
coalesce(selected_df[col_name].cast(StringType()), lit(""))
)
|-- Col1: string (nullable = false)
|-- Col2: string (nullable = false)
When I write the Dataframe to Redshift, I get the error o323.save. ORA-01722: invalid number
.
# Redshift Connection Configuration
redshift_connection_name = 'redshift_connection'
redshift_connection_options = glueContext.extract_jdbc_conf(redshift_connection_name)
redshift_url = redshift_connection_options["fullUrl"]
redshift_user = redshift_connection_options["user"]
redshift_password = redshift_connection_options["password"]
redshift_driver_name = "com.amazon.redshift.jdbc42.Driver"
redshift_table = "test_schema.main_last"
# Writing data to Redshift
selected_df.write \
.format("jdbc") \
.option("driver", redshift_driver_name) \
.option("url", redshift_url) \
.option("dbtable", redshift_table) \
.option("user", redshift_user) \
.option("password", redshift_password) \
.option("tempdir", "s3://glue-temp/") \
.mode("overwrite") \
.save()
I have made sure the Redshift DDL:
CREATE TABLE test_schema.main_last (
Col1 character varying(256) ENCODE lzo,
Col2 character varying(256) ENCODE lzo,
ORA-01722: invalid number is ORACLE error caused by failed attempted conversion to number either explicite or implicite. What you do after (outside Oracle), in this case, is irrelevant. Your Oracle's Select command returned this error.
Where the error comes from:
# Reading data from Oracle
oracle_df = spark.read \
.format("jdbc") \
.option("url", oracle_url) \
.option("dbtable", oracle_table) \
.option("user", oracle_user) \
.option("password", oracle_password) \
.option("driver", oracle_driver_name) \
.load()
The above code runs "Select [column list] From oracle_table" statement in Oracle database and that statement errors as ORA-01722 Invalid number.
Possible solution:
Since the error is raised by reading data from Oracle, make sure you control what you are reading and how. Is it possible that you are loading the data from an Oracle view, not a table, it looks just like there is an error in the view definition and/or view data that you are reading from. If that is the case then it will be more complicated to solve this - unless you have access to the view.
Use query option instead of dbtable.
# Reading data from Oracle
# Use query option instead of dbtable
#
# query_0 test query to be sure that the logic works
query_0 = "select 'A' as COL1, 1 as COL2 from dual"
#
# query_1 reads original data
query_1 = "select COL1, COL2 from oracle_table"
#
# query_2 converts original data to char on Oracle db side
query_2 = "select To_Char(COL1) as COL1, To_Char(COL2) as COL2 from oracle_table"
oracle_df = spark.read \
.format("jdbc") \
.option("url", oracle_url) \
.option("query", query_2)\
.option("user", oracle_user) \
.option("password", oracle_password) \
.option("driver", oracle_driver_name) \
.load()
The goal is either to isolate "corrupted" data or to build a workaround (with Oracle's TO_CHAR() function) and get your data loaded.
Start with query_0 just to make sure that the structure, driver, connection and the logic works ok.
If you want to find out which column (if it is a column) throws the error, start querying with just one column with original data and add the others (one by one) till it errors (query_1) meaning the column added last is corrupted. If all particular columns raise the same error then you are, probably, reading a faulty view.
If you want to load the columns data as strings then declare your query doing the conversion on the Oracle database side (query_2). Converted values should be ok, just take care of columns that could be of DATE or TIMESTAMP datatype in oracle. If this fails too, again, you are, probably, reading a faulty view.
Handle fetched data after they were successfuly loaded.
If you are still getting the same error then the problem is not in the data selected but, most likely, you are reading an Oracle view (almost sure it's a view) with an error not in selection list (the data) of the view - it is in some other part of Select statement such as Where or Having or somewhere else. To solve this someone should correct the view definition on Oracle side, test it and make sure it works ok.
Here are just a few common examples of Select commands ending with ORA-01722 error
-- depending on your data - explicite conversion
Select To_Number('a23') "COL_NAME" From Dual;
-- ORA-01722: invalid number
Select Cast('a23' as Number) "COL_NAME" From Dual;
-- ORA-01722: invalid number
-- depending on your data and/or data manipulation - implicite conversion
Select 'a23' * 3 "COL_NAME" From Dual;
-- ORA-01722: invalid number
-- depending on your db settings of NLS parameters - if decimal point is comma, but dot is used when selecting the data
Select '2.3' * 3 "COL_NAME" From Dual;
-- ORA-01722: invalid number
Dealing with DATE datatype and date manipulations or date math could also couse this error
FAULTY VIEW SAMPLE 1 ( COL1 data ) :
Create Table TEST_TBL AS
( Select 'A' as A_LETTER, 1 as A_NUMBER, To_Date('01.01.2024', 'dd.mm.yyyy') as A_DATE From Dual Union All
Select 'B', 2, To_Date('02.01.2024', 'dd.mm.yyyy') From Dual Union All
Select 'C', 3, To_Date('03.01.2024', 'dd.mm.yyyy') From Dual
) ;
-- Table TEST_TBL created.
Create View TEST_VIEW AS
Select A_LETTER, A_NUMBER, A_DATE,
A_LETTER + A_NUMBER as COL1 -- expresion when run against the data will raise the error
From TEST_TBL;
-- View TEST_VIEW created.
When you select any column except COL1 there will be no error
Select A_LETTER From TEST_VIEW;
/*
A_LETTER
---------
A
B
C */
Select A_LETTER, A_NUMBER, A_DATE From TEST_VIEW;
/*
A_LETTER A_NUMBER A_DATE
-------- ---------- --------
A 1 01.01.24
B 2 02.01.24
C 3 03.01.24 */
... when COL1 is selected the error is raised ...
Select A_LETTER, A_NUMBER, A_DATE, COL1 From TEST_VIEW;
-- ORA-01722: invalid number
FAULTY VIEW SAMPLE 2 ( Where clause ) :
Create View TEST_VIEW AS
Select A_LETTER, A_NUMBER, A_DATE
From TEST_TBL
Where A_LETTER + A_NUMBER > 1;
-- View TEST_VIEW created.
Selecting any column from such a view will raise the error.
Select A_LETTER From TEST_VIEW;
-- ORA-01722: invalid number
FAULTY VIEW SAMPLE 3 - with data issue ( in column A_LETTER ) :
If the underlaying table data is like below:
Select * From TEST_TBL;
/*
A_LETTER A_NUMBBER A_DATE
-------- --------- ----------
11 1 01.01.24
12 2 02.01.24
13C 3 03.01.24 */
... here A_LETTER column has values that could be implicitely converted to number (except the last row).
If the view is defined like this:
Create View TEST_VIEW AS
Select A_LETTER, A_NUMBER, A_DATE, A_LETTER + A_NUMBER as COL1
From TEST_TBL
Where A_LETTER + A_NUMBER > 1;
-- View TEST_VIEW created.
... and if you try to select all the data - an error will be raised:
Select A_LETTER, A_NUMBER, A_DATE, COL1
From TEST_VIEW
-- ORA-01722: invalid number
... but if you know/suspect that column A_LETTER could have some incorrect data - you can exclude such rows and fetch all the remaining data (using ... Default value On Conversion error when attempting to convert char value into number)
Select A_LETTER, A_NUMBER, A_DATE, COL1
From TEST_VIEW
Where To_Number(A_LETTER Default 0 ON Conversion Error) != 0
/*
A_LETTER A_NUMBER A_DATE COL1
-------- ---------- -------- ----------
11 1 01.01.24 12
12 2 02.01.24 14 */