Search code examples
oracle-databaseamazon-web-servicesaws-glue

Unable to connect oracle database using cx_oracle from AWS Glue


I am trying to connect oracle database from AWS glue using cx_oracle but i am getting this error message

DatabaseError: DPI-1047: Cannot locate a 64-bit Oracle Client library: "libclntsh.so: cannot open shared object file: No such file or directory". See https://cx-oracle.readthedocs.io/en/latest/user_guide/installation.html for help

I tried to downoad the so files as per the document and stored it in S3 which has been linked to Glue with --extra-files parameter but still getting same error message

I have tried this stackoverflow question and also tried to set rpath with s3 url but no success. Any idea would really help


Solution

  • Credits

    This answer is a compilation of this and this and a lot of discussion around the former in the comments. Credit for the rpath patching solution goes to @harjeet-singh, the original author of the abovementioned answer and for the libaio to @good-will, but there are still some steps around these solutions that keep confusing people, so that is why I am going to consolidate everything in a single step-by-step answer here.

    Background

    In order to connect to an Oracle database using cx-Oracle from a Python shell AWS Glue job, we need to bundle the oracle client libraries with it. Furthermore, the libraries have to be patched with a correct rpath in order to load correctly, because in Glue runtime we have filesystem write access only to /tmp, which is where our archive will be, but cx-Oracle can't know that and expects a different directory by default. And the LD_LIBRARY_PATH hack cannot be implemented because we have no control over how the Glue job is launched.

    Step-By-Step Guide

    1. Download Instant Client Basic ZIP package for x86-64 Linux from here. This guide uses version 21.5.0.0.0
    wget https://download.oracle.com/otn_software/linux/instantclient/instantclient-basic-linuxx64.zip
    
    1. Unzip the archive
    unzip instantclient-basic-linuxx64.zip 
    
    1. Remove symlinks from the archive and move the file they point to (in this case libclntsh.so.21.1 to the one that is going to be looked up when you use cx-Oracle: libclntsh.so). This is done because whatever loads these libraries dynamically apparently does not resolve symlinks. Maybe it will in the future, but I had to do this to make it work.
    cd instantclient_21_5/
    find . -type l -name "libclntsh.so*" -delete
    mv libclntsh.so.21.1 libclntsh.so
    

    Do the same with the other files with symlinks if after completing the whole guide and running your job you still have problems with libraries like

    DatabaseError: DPI-1047: Cannot locate a 64-bit Oracle Client library: "libsomething.so: cannot open shared object file: No such file or directory"

    1. Patch the rpath to point to a static directory that we are going to be using from inside the Glue job For example, if your archive is named instant-client-basic-linux.x64-21.5.0.0.0 and it contains a folder named instantclient_21_5 with all the libraries. When the job runs, this archive is going to be available in a random directory under /tmp (more on that below. We need to find our archive in one of those directories and extract it to a static directory under /tmp, for example /tmp/libs. So then, your rpath would be /tmp/libs/instant-client-basic-linux.x64-21.5.0.0.0/instantclient_21_5, because that in the absolute path to client libraries.
    sudo apt-get update
    sudo apt-get install patchelf -y
    patchelf --set-rpath /tmp/libs/instant-client-basic-linux.x64-21.5.0.0.0/instantclient_21_5 libclntsh.so
    
    1. Put libaio.so.1 in the archive
    cd ..
    wget https://src.fedoraproject.org/lookaside/pkgs/libaio/libaio-0.3.110.tar.gz/2a35602e43778383e2f4907a4ca39ab8/libaio-0.3.110.tar.gz
    tar xzvf libaio-0.3.110.tar.gz
    cd libaio-0.3.110
    make prefix=`pwd`/usr install 
    find ./usr/lib/ -type l -name "libclntsh.so*" -delete
    mv ./usr/lib/libaio.so.1.0.1 ../instantclient_21_5/libaio.so.1
    

    Note: you might want to check with a newer version of libaio if there is one. 6. Zip the archive

    cd ..
    zip -T -r instantclient-basic-linuxx64_patched.zip instantclient_21_5/
    
    1. Download cx-Oracle wheel
    wget https://files.pythonhosted.org/packages/a9/b7/c2d0223fb4f1013b090cf82f3ce56f36f33b79a48f9c33b36717c2977b04/cx_Oracle-8.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
    
    1. Upload the archive and cx-Oracle to S3
    aws s3 cp instantclient-basic-linuxx64_patched.zip s3://<mybucket>/glue_libs
    aws s3 cp cx_Oracle-8.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl s3://<mybucket>/glue_libs
    
    1. Configure your Glue job

    I assume your glue job is already created and we are going to just configure it.

    • Put the S3 URL to the archive in "Referenced files path" configuration parameter
    • Put the S3 URL to the cx-Oracle wheel in "Python library path" configuration parameter
    1. Add a bit of code to the glue job to set up the libraries. This code must be executed before any usage of cx-Oracle. It can be executed after import, but before the usage.

    Iterate through random directories in /tmp, find an archive you created and extract it to a static directory that we set in rpath previously. Then initialize the cx-Oracle client and you are good to go. Here is an example of an implementation:

    import zipfile
    
    from pathlib import Path
    
    import cx_Oracle
    
    filename = 'instantclient-basic-linuxx64_patched.zip'
    oracle_archive = next(Path('./tmp').glob(f'**/{filename}'))
    with zipfile.ZipFile(oracle_archive, 'r') as f:
        Path('./tmp/libs').mkdir()
        f.extractall('./tmp/libs')
    
    cx_Oracle.init_oracle_client(lib_dir=f'/tmp/libs/{filename}/instantclient_21_5')
    

    TLDR

    Run this on your Linux machine (substitute your bucket name in the end):

    wget https://download.oracle.com/otn_software/linux/instantclient/instantclient-basic-linuxx64.zip
    unzip instantclient-basic-linuxx64.zip
    cd instantclient_21_5/
    find . -type l -name "libclntsh.so*" -delete
    mv libclntsh.so.21.1 libclntsh.so
    sudo apt-get update
    sudo apt-get install patchelf -y
    patchelf --set-rpath /tmp/libs/instant-client-basic-linux.x64-21.5.0.0.0/instantclient_21_5 instantclient_21_5/libclntsh.so
    cd ..
    wget https://src.fedoraproject.org/lookaside/pkgs/libaio/libaio-0.3.110.tar.gz/2a35602e43778383e2f4907a4ca39ab8/libaio-0.3.110.tar.gz
    tar xzvf libaio-0.3.110.tar.gz
    cd libaio-0.3.110
    make prefix=`pwd`/usr install 
    find ./usr/lib/ -type l -name "libclntsh.so*" -delete
    mv ./usr/lib/libaio.so.1.0.1 ../instantclient_21_5/libaio.so.1
    cd ..
    zip -T -r instantclient-basic-linuxx64_patched.zip instantclient_21_5/
    wget https://files.pythonhosted.org/packages/a9/b7/c2d0223fb4f1013b090cf82f3ce56f36f33b79a48f9c33b36717c2977b04/cx_Oracle-8.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl
    aws s3 cp instantclient-basic-linuxx64_patched.zip s3://<mybucket>/glue_libs/
    aws s3 cp cx_Oracle-8.3.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl s3://<mybucket>/glue_libs/
    

    Follow steps 9 and 10 above to configure your job.

    Hope anyone who reads this will get this right on the first try, because I sure didn't.