Search code examples
sql-serverhadoopjdbcsqoophue

JDBC Driver Class for MS SQL Server is not found in Hue


We installed a clustered Hadoop server andwe use Hue as our interface, our goal is sqooping data from MS SQL Server to Hadoop. We found a tutorial here

However I get the follwowing error in Hue

enter image description here


Solution

  • I found the solution with the help of http://capnjosh.com/

    1. If you’re using the Sqoop stuff in the web-based interface, you’re actually using Sqoop2

    2. You have to download and install the JDBC driver for SQL Server yourself – curl -L ‘http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz’ | tar xz – sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop2/ – while you’re at it, you may as well put it in the sqoop directory too: sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop/

    3. Sqoop2 home directory is /var/lib/sqoop2/

    4. restart Sqoop2 service after copying in the JDBC driver file

    5a. “Connector” is a Sqoop thing for how it communicates with various processes in Hadoop. Unless you’ve got a lot more experience, it should just be “generic-jdbc-connector”

    5b. Class Path is “com.microsoft.sqlserver.jdbc.SQLServerDriver”

    1. connection string in “Manage Connections” is like this: jdbc:sqlserver://192.168.1.102:1433 (though port number defaults to 1433)

    For the Actions of the job:

    1. Schema name: I just leave this blank and instead paste in the TSQL query I want – if you specify a TSQL statement below, then this needs to be blank

    2. Table name: I leave this blank and instead do it all in the TSQL. – if you specify a TSQL statement below, then this needs to be blank

    3. Table SQL statement: Paste in your query (you can craft it in SSMS and paste it in here). Then, append this to the end of it: +and+${CONDITIONS}. ${CONDITIONS} expands out to be some range of values of the Partition Column name you can specify below this field.

    4. Table Column names: put them in if you want to limit the columns that actually get extracted.

    5. Partition column name: Make sure this column is indexed somehow – Sqoop first queries the min and max values then issues a series of queries that return evenly-distributed portions of all rows based on this column value. e.g. a transactions table; I specify the transaction date column in Partition column name; sqoop gets the min and max dates; Sqoop then issues a series of queries replacing ${CONDITIONS} with “where transDate >= ‘2015-01-01’ and transDate < '2015-04-01'" (moving that window for each query). Each query can be sent from any node in your cluster (though I bet you can restrict which nodes those are)12. Nulls in partition Column: if you do have nulls, this helps Sqoop.13. You can manually specify the query Sqoop uses to get the min/max of the partition column name (by default it looks like this "select min(), max() from ()”.

    6. if you mess with the connection you create in Hue/Sqoop2, note you have to type in the password again

    7. if you get errors, don’t fight it – you have to log in via SSH and look at /var/log/sqoop2/sqoop2.log

    8. if your jobs are failing, and in SQL Server Profiler on the SQL Server you’re querying you only see queries with “where … (1 = 0)…” in them, check your firewall rules: all the nodes in the cluster need to be able to talk out to the SQL Server instance. Yeah, Sqoop will distribute the various partitioned queries across your cluster :)