Search code examples
python-3.xoraclesqlalchemysqldatatypes

Convert Oracle Datatypes to SQLAlchemy Types


I have a list of oracle datatypes (hundreds of them), which I would like to programmatically map to SQLAlchemy Datatypes.

Some example oracle datatypes (among many others) are:

XMLTYPE
VARCHAR2
UROWID
URITYPE
UNDEFINED
TIMESTAMP(9)

I am querying data from oracle into a pandas dataframe, and then using sql alchemy to load that data to another system. However I need to specify what the target datatypes should be in the target system. (Just using the dataframe defaults doesn't work due to the ETL process being batched; the first batch of values can "mislead" the datatype.)

Here is a list of SQLAlchemy datatypes, and I need to map the oracle datatypes to the sqlalchemy ones. Is there an elegant way to do so other than manually constructing such a dictionary?

EDIT: The specific example datatypes I gave were not important. I would rather programmatically and dynamically assign an arbitrary column from Oracle to the correct SQLAlchemy datatype. The datatypes are the response to

SELECT
    DISTINCT data_type
FROM
    all_tab_columns;

which returned upwards of 150 values. So I'd really prefer to avoid manually looking up documentation on each.


Solution

    • The XMLTYPE data type has getStringVal and getClobVal methods (depending on length of the content) you can use to extract the XML content which would just be a string value and could use sqlalchemy.types.Text (after calling the getClobVal method).

    • VARCHAR2 is just a variable-length string. Oracle does not have a VARCHAR data type, this is its equivalent (and VARCHAR is currently a synonym of VARCHAR2). The equivalent is sqlalchemy.types.String.

    • UROWID is documented to be:

      Each row in a database has an address. However, the rows of some tables have addresses that are not physical or permanent or were not generated by Oracle Database. For example, the row addresses of index-organized tables are stored in index leaves, which can move. Rowids of foreign tables (such as DB2 tables accessed through a gateway) are not standard Oracle rowids.

      Oracle uses universal rowids (urowids) to store the addresses of index-organized and foreign tables. Index-organized tables have logical urowids and foreign tables have foreign urowids. Both types of urowid are stored in the ROWID pseudocolumn (as are the physical rowids of heap-organized tables).

      It makes no sense to persist a UROWID outside the database as the data it represents may not be permanent and may move and, even if it does not move, it just represents a memory-location and not any actual data.

    • URITYPE represents a URI to another resource and you can just use its getURL method to get the URI it represents, which should just be a string value. The equivalent data-type would be sqlalchemy.types.String or sqlalchemy.types.Text (after calling the getURL method to get the string representation).

    • UNDEFINED is not a data type. You should check if someone has created a user-defined type called UNDEFINED.

    • TIMESTAMP(9) is just a TIMESTAMP with a precision of 9 decimal seconds digits. From the documentation, it appears you want sqlalchemy.dialects.oracle.DATE.


    The datatypes are the response to

    SELECT DISTINCT data_type FROM all_tab_columns;
    

    Please do not rely on this list as this includes all the data types used in the tables of the *SYS schemas and you should not touch most of these tables as changing them can have unforeseen consequences (including making your database unusable).

    Most of those data types are going to be private data types used within the internal workings of the database; you can determine those as they commonly have a $ in the type name. The next most common grouping is going to be the SDO geometry data types and these all have the SDO_ prefix.

    So if your query then becomes:

    SELECT CASE
           WHEN data_type LIKE '%$%' THEN 'Private Type'
           WHEN data_type LIKE 'SDO_%' THEN 'SDO Geometry Type'
           ELSE data_type
           END AS data_type,
           COUNT( DISTINCT data_type ) AS num_instances
    FROM   all_tab_columns
    GROUP BY
           CASE
           WHEN data_type LIKE '%$%' THEN 'Private Type'
           WHEN data_type LIKE 'SDO_%' THEN 'SDO Geometry Type'
           ELSE data_type
           END
    ORDER BY num_instances DESC, data_type ASC;
    

    Then, on a db<>fiddle you get this output, with zero user-generated tables (so only for the system generated tables):

    DATA_TYPE                    | NUM_INSTANCES
    :--------------------------- | ------------:
    Private Type                 |           130
    SDO Geometry Type            |             5
    ANYDATA                      |             1
    BINARY_DOUBLE                |             1
    BLOB                         |             1
    CHAR                         |             1
    CLOB                         |             1
    DATE                         |             1
    DS_VARRAY_4_CLOB             |             1
    FLOAT                        |             1
    HSBLKNAMLST                  |             1
    HSBLKVALARY                  |             1
    INTERVAL DAY(3) TO SECOND(0) |             1
    INTERVAL DAY(3) TO SECOND(2) |             1
    INTERVAL DAY(9) TO SECOND(6) |             1
    LONG                         |             1
    LONG RAW                     |             1
    NUMBER                       |             1
    NVARCHAR2                    |             1
    RAW                          |             1
    ROWID                        |             1
    TIMESTAMP(0)                 |             1
    TIMESTAMP(3)                 |             1
    TIMESTAMP(6)                 |             1
    TIMESTAMP(6) WITH TIME ZONE  |             1
    TIMESTAMP(9)                 |             1
    UNDEFINED                    |             1
    VARCHAR2                     |             1
    XMLTYPE                      |             1
    

    And 130 of the 162 data types are private and another 5 are SDO geometry types.

    Investigating the other "interesting types":

    SELECT owner, table_name, column_name, data_type
    FROM   ALL_TAB_COLUMNS
    WHERE  data_type IN ( 'UNDEFINED', 'HSBLKVALARY', 'HSBLKNAMLST', 'ROWID', 'ANYDATA', 'DS_VARRAY_4_CLOB' )
    ORDER BY owner, table_name, data_type
    

    Outputs:

    OWNER  | TABLE_NAME                    | COLUMN_NAME           | DATA_TYPE       
    :----- | :---------------------------- | :-------------------- | :---------------
    CTXSYS | CTX_USER_PENDING              | PND_ROWID             | ROWID           
    CTXSYS | DRV$PENDING                   | PND_ROWID             | ROWID           
    CTXSYS | DRV$UNINDEXED                 | UNX_ROWID             | ROWID           
    CTXSYS | DRV$UNINDEXED2                | UNX_ROWID             | ROWID           
    CTXSYS | DRV$WAITING                   | WTG_ROWID             | ROWID           
    MDSYS  | SDO_GR_MOSAIC_0               | RID                   | ROWID           
    MDSYS  | SDO_GR_MOSAIC_1               | RID                   | ROWID           
    MDSYS  | SDO_GR_MOSAIC_2               | RID                   | ROWID           
    SYS    | ALL_SCHEDULER_JOB_ARGS        | ANYDATA_VALUE         | ANYDATA         
    SYS    | ALL_SCHEDULER_PROGRAM_ARGS    | DEFAULT_ANYDATA_VALUE | ANYDATA         
    SYS    | ALL_SQLSET_BINDS              | VALUE                 | ANYDATA         
    SYS    | ALL_STREAMS_MESSAGE_CONSUMERS | NOTIFICATION_CONTEXT  | ANYDATA         
    SYS    | ALL_SUMDELTA                  | HIGHROWID             | ROWID           
    SYS    | ALL_SUMDELTA                  | LOWROWID              | ROWID           
    SYS    | HS$_PARALLEL_METADATA         | PARTITION_COL_TYPES   | HSBLKNAMLST     
    SYS    | HS$_PARALLEL_METADATA         | PARTITION_COL_NAMES   | HSBLKNAMLST     
    SYS    | HS_PARALLEL_METADATA          | PARTITION_COL_NAMES   | HSBLKNAMLST     
    SYS    | HS_PARALLEL_METADATA          | PARTITION_COL_TYPES   | HSBLKNAMLST     
    SYS    | HS_PARALLEL_PARTITION_DATA    | PARTITION_COL_TYPES   | HSBLKNAMLST     
    SYS    | HS_PARALLEL_PARTITION_DATA    | PARTITION_COL_NAMES   | HSBLKNAMLST     
    SYS    | HS_PARALLEL_PARTITION_DATA    | HIGH_VALUE            | HSBLKVALARY     
    SYS    | HS_PARALLEL_PARTITION_DATA    | LOW_VALUE             | HSBLKVALARY     
    SYS    | ORA_KGLR7_IDL_SB4             | PIECE                 | UNDEFINED       
    SYS    | ORA_KGLR7_IDL_UB2             | PIECE                 | UNDEFINED       
    SYS    | USER_COMPARISON_ROW_DIF       | LOCAL_ROWID           | ROWID           
    SYS    | USER_COMPARISON_ROW_DIF       | REMOTE_ROWID          | ROWID           
    SYS    | USER_PARALLEL_EXECUTE_CHUNKS  | END_ROWID             | ROWID           
    SYS    | USER_PARALLEL_EXECUTE_CHUNKS  | START_ROWID           | ROWID           
    SYS    | USER_SCHEDULER_JOB_ARGS       | ANYDATA_VALUE         | ANYDATA         
    SYS    | USER_SCHEDULER_PROGRAM_ARGS   | DEFAULT_ANYDATA_VALUE | ANYDATA         
    SYS    | USER_SQLSET_BINDS             | VALUE                 | ANYDATA         
    SYS    | USER_SQLTUNE_BINDS            | VALUE                 | ANYDATA         
    SYS    | USER_SR_STLOG_EXCEPTIONS      | BAD_ROWID             | ROWID           
    SYS    | USER_SUBSCR_REGISTRATIONS     | ANY_CONTEXT           | ANYDATA         
    SYS    | _USER_COMPARISON_ROW_DIF      | RMT_ROWID             | ROWID           
    SYS    | _USER_COMPARISON_ROW_DIF      | LOC_ROWID             | ROWID           
    SYS    | _user_stat_varray             | CL1                   | DS_VARRAY_4_CLOB
    XDB    | XDB$ROOT_INFO_V               | RESOURCE_ROOT         | ROWID           
    

    These are all *SYS tables or private tables and you are almost certainly never going to want to interact with them directly.

    The UNDEFINED type is interesting as it does appear to be undefined:

    SELECT owner, type_name
    FROM   ALL_TYPES
    WHERE  TYPE_NAME = 'UNDEFINED';
    

    Returns zero rows and:

    CREATE TABLE TABLE_NAME ( id UNDEFINED );
    

    Raises the exception ORA-00902: invalid datatype.


    What would be a better determination of the types you are using? Just look in the schemas you have created:

    SELECT DISTINCT owner, data_type
    FROM   all_tab_columns
    WHERE  owner IN ( 'USER1', 'USER2', 'USER3' )
    

    Then you know that those data types are the ones in use within the tables your users have created. You should find that, unless you are doing something esoteric, most (if not all) of the types you are using are going to be handled natively by SQLAlchemy.

    db<>fiddle here