We installed a clustered Hadoop
server andwe use Hue
as our interface, our goal is sqooping data from MS SQL Server
to Hadoop
. We found a tutorial here
However I get the follwowing error in Hue
I found the solution with the help of http://capnjosh.com/
If you’re using the Sqoop stuff in the web-based interface, you’re actually using Sqoop2
You have to download and install the JDBC driver for SQL Server yourself – curl -L ‘http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz’ | tar xz – sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop2/ – while you’re at it, you may as well put it in the sqoop directory too: sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop/
Sqoop2 home directory is /var/lib/sqoop2/
restart Sqoop2 service after copying in the JDBC driver file
5a. “Connector” is a Sqoop thing for how it communicates with various processes in Hadoop. Unless you’ve got a lot more experience, it should just be “generic-jdbc-connector”
5b. Class Path is “com.microsoft.sqlserver.jdbc.SQLServerDriver”
For the Actions of the job:
Schema name: I just leave this blank and instead paste in the TSQL query I want – if you specify a TSQL statement below, then this needs to be blank
Table name: I leave this blank and instead do it all in the TSQL. – if you specify a TSQL statement below, then this needs to be blank
Table SQL statement: Paste in your query (you can craft it in SSMS and paste it in here). Then, append this to the end of it: +and+${CONDITIONS}. ${CONDITIONS} expands out to be some range of values of the Partition Column name you can specify below this field.
Table Column names: put them in if you want to limit the columns that actually get extracted.
Partition column name: Make sure this column is indexed somehow – Sqoop first queries the min and max values then issues a series of queries that return evenly-distributed portions of all rows based on this column value. e.g. a transactions table; I specify the transaction date column in Partition column name; sqoop gets the min and max dates; Sqoop then issues a series of queries replacing ${CONDITIONS} with “where transDate >= ‘2015-01-01’ and transDate < '2015-04-01'" (moving that window for each query). Each query can be sent from any node in your cluster (though I bet you can restrict which nodes those are)12. Nulls in partition Column: if you do have nulls, this helps Sqoop.13. You can manually specify the query Sqoop uses to get the min/max of the partition column name (by default it looks like this "select min(), max() from ()”.
if you mess with the connection you create in Hue/Sqoop2, note you have to type in the password again
if you get errors, don’t fight it – you have to log in via SSH and look at /var/log/sqoop2/sqoop2.log
if your jobs are failing, and in SQL Server Profiler on the SQL Server you’re querying you only see queries with “where … (1 = 0)…” in them, check your firewall rules: all the nodes in the cluster need to be able to talk out to the SQL Server instance. Yeah, Sqoop will distribute the various partitioned queries across your cluster :)