oracle-database pyspark hdfs sqoop flume

Loading Oracle data in HDFS without Sqoop

I wanted to import data from an Oracle database to our Hadoop HDFS and considered using Sqoop. When I tried, I discovered that the data connector for Oracle and Hadoop was disconnected.

2019-07-18 09:19:58,203 [uber-SubtaskRunner] INFO org.apache.sqoop.manager.oracle.OraOopManagerFactory - Data Connector for Oracle and Hadoop is disabled.

I contacted the sysadmin and who let me know that, our Sqoop is currently not configured for Oracle databases and that it won't be. Instead they recommended using the below pyspark script.

I used the mentioned script on our CDSW and it worked really well with some configuration on my end.


import os
import netrc
from pyspark.sql import SparkSession

'''Set up the pyspark dependencies: In order to connect to the Oracle DB 
via JDBC we are going to need the jar provided by Oracle'''
ORACLE_JAR = "ojdbc7.jar"


JAR_LOC = os.path.join(os.environ["JARS_DIR"], ORACLE_JAR)

#Create a SparkSession
spark = SparkSession.builder \
    .appName("My Sample App") \
    .config("spark.jars", "local://" + JAR_LOC) \
    .getOrCreate()

# Set the Exadata host to which you are connecting to
db_host = "exadata.host.address"

#Read the values from the .netrc 
user, account, password = netrc.netrc().authenticators(db_host)

# Check the Spark version and other config information
spark.sparkContext.getConf().getAll()
driver = "oracle.jdbc.OracleDriver"

#The SID of your database
sid = "mydb.sid.tns"
url = "/".join(["jdbc:oracle:thin:@/", db_host +":1521", sid])

# The query that you need to run
dbtable = "(select * from table)"

jdbc_df = spark.read.format("jdbc").option("url", url) \
                                   .option("driver", driver) \
                                   .option("dbtable", dbtable) \
                                   .option("user", user) \
                                   .option("password", password).load()

My question is: What are the relative advantages/disadvantages of using this pyspark script over Sqoop or Flume?

Solution

Reading data from relational databases using spark jdbc is way faster than SQOOP. Here are few of the benefits.

You can combine all the read, transform and write operations into one script/program instead of reading it separately through SQOOP in one script and then doing transformation and write in another.
You can define a new split column on the fly (using functions like ORA_HASH) if you want the data to be partitioned in a proper way.
You can control the number of connection to the database. Increasing the number of connection will surely speed up your data import.