Search code examples
mysqlhivehdfssqoop

SQOOP export HDFS to MYSQL db


I'm trying to export a HDFS to MYSQL database. I found various different solution but none of them worked, I even tried to remove the WINDOWS-1251 chars from the file.

As a small summary - I'm using virtualbox with Hortonworks image for this operations.

My HIVE in the default database:

CREATE EXTERNAL TABLE `airqualitydata`(
  `sensor_id` VARCHAR(100),
  `sensor_type` VARCHAR(100), 
  `location` VARCHAR(100), 
  `lat` VARCHAR(100), 
  `lon` VARCHAR(100), 
  `timestamp` timestamp, 
  `p1` VARCHAR(100), 
  `durp1` VARCHAR(100), 
  `ratiop1` VARCHAR(100), 
  `p2` VARCHAR(100), 
  `durp2` VARCHAR(100), 
  `ratiop2` VARCHAR(100))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LOCATION 'hdfs://sandbox-hdp.hortonworks.com:8020/hadoop/airqualitydata'
TBLPROPERTIES ("skip.header.line.count"="1");

The file contained in /hadoop/airqualitydata HDFS (removed the win1251 chars just to be sure).

Note that this data can be visualized by querying SELECT * FROM airqualitydata in the hive.

sensor_id;sensor_type;location;lat;lon;timestamp;P1;durP1;ratioP1;P2;durP2;ratioP2
9710;SDS011;4894;43.226;27.934;2021-09-09T00:00:12;70;;;20;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:02:41;83;;;0.93;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:05:14;0.80;;;0.73;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:07:42;0.50;;;0.50;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:10:10;57;;;0.80;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:12:39;0.40;;;0.40;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:15:07;0.70;;;0.70;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:17:35;2;;;0.47;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:20:04;90;;;0.63;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:22:34;0.57;;;0.57;;
9710;SDS011;4894;43.226;27.934;2021-09-09T00:25:01;0.73;;;0.60;;

MYSQL DB & TABLE:

CREATE DATABASE airquality CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci;
CREATE TABLE `airqualitydata`(
  `sensor_id` VARCHAR(100), 
  `sensor_type` VARCHAR(100), 
  `location` VARCHAR(100), 
  `lat` VARCHAR(100), 
  `lon` VARCHAR(100), 
  `timestamp` timestamp, 
  `p1` VARCHAR(100), 
  `durp1` VARCHAR(100), 
  `ratiop1` VARCHAR(100), 
  `p2` VARCHAR(100), 
  `durp2` VARCHAR(100), 
  `ratiop2` VARCHAR(100)
);

SQOOP CLI call:

sqoop export --connect "jdbc:mysql://localhost:3306/airquality?useUnicode=true&characterEncoding=WINDOWS-1251" --username root --password hortonworks1 --export-dir hdfs://sandbox-hdp.hortonworks.com:8020/hadoop/airqualitydata --table airqualitydata --input-fields-terminated-by "\073" --input-lines-terminated-by "\n" -m 1

I removed the ?useUnicode=true&characterEncoding=WINDOWS-1251 with no success. I also cannot access the log from the URL given in the terminal, so I got only this as failure:

Warning: /usr/hdp/2.6.5.0-292/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
21/09/12 04:04:40 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.6.5.0-292
21/09/12 04:04:40 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
21/09/12 04:04:40 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
21/09/12 04:04:40 INFO tool.CodeGenTool: Beginning code generation
Sun Sep 12 04:04:40 UTC 2021 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
21/09/12 04:04:40 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `airqualitydata` AS t LIMIT 1
21/09/12 04:04:40 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `airqualitydata` AS t LIMIT 1
21/09/12 04:04:40 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.6.5.0-292/hadoop-mapreduce
Note: /tmp/sqoop-raj_ops/compile/41fba9933b913b974b70403656a13287/airqualitydata.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
21/09/12 04:04:42 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-raj_ops/compile/41fba9933b913b974b70403656a13287/airqualitydata.jar
21/09/12 04:04:42 INFO mapreduce.ExportJobBase: Beginning export of airqualitydata
21/09/12 04:04:43 INFO client.RMProxy: Connecting to ResourceManager at sandbox-hdp.hortonworks.com/172.18.0.2:8032
21/09/12 04:04:43 INFO client.AHSProxy: Connecting to Application History server at sandbox-hdp.hortonworks.com/172.18.0.2:10200
21/09/12 04:04:50 INFO input.FileInputFormat: Total input paths to process : 1
21/09/12 04:04:50 INFO input.FileInputFormat: Total input paths to process : 1
21/09/12 04:04:50 INFO mapreduce.JobSubmitter: number of splits:1
21/09/12 04:04:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1631399426919_0028
21/09/12 04:04:51 INFO impl.YarnClientImpl: Submitted application application_1631399426919_0028
21/09/12 04:04:51 INFO mapreduce.Job: The url to track the job: http://sandbox-hdp.hortonworks.com:8088/proxy/application_1631399426919_0028/
21/09/12 04:04:51 INFO mapreduce.Job: Running job: job_1631399426919_0028
21/09/12 04:04:59 INFO mapreduce.Job: Job job_1631399426919_0028 running in uber mode : false
21/09/12 04:04:59 INFO mapreduce.Job:  map 0% reduce 0%
21/09/12 04:05:03 INFO mapreduce.Job:  map 100% reduce 0%
21/09/12 04:05:04 INFO mapreduce.Job: Job job_1631399426919_0028 failed with state FAILED due to: Task failed task_1631399426919_0028_m_000000
Job failed as tasks failed. failedMaps:1 failedReduces:0

21/09/12 04:05:04 INFO mapreduce.Job: Counters: 8
        Job Counters
                Failed map tasks=1
                Launched map tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=2840
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=2840
                Total vcore-milliseconds taken by all map tasks=2840
                Total megabyte-milliseconds taken by all map tasks=710000
21/09/12 04:05:04 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
21/09/12 04:05:04 INFO mapreduce.ExportJobBase: Transferred 0 bytes in 21.2627 seconds (0 bytes/sec)
21/09/12 04:05:04 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
21/09/12 04:05:04 INFO mapreduce.ExportJobBase: Exported 0 records.
21/09/12 04:05:04 ERROR mapreduce.ExportJobBase: Export job failed!
21/09/12 04:05:04 ERROR tool.ExportTool: Error during export: Export job failed!

Any directions will be helpful, Thanks!

EDIT #1: As per the comments above, using:

sqoop export --connect jdbc:mysql://localhost:3306/airquality  --table airqualitydata  --username root --password hortonworks1 --hcatalog-database default --hcatalog-table airqualitydata --verbose

or basically (for people reproducing)

sqoop export --connect jdbc:mysql://<host:port>/<mysql db> --table <mysql table> --username <mysql_user> --password <mysqlpass> --hcatalog-database <hive_db> --hcatalog-table <hive_table> --verbose

I got it to put the data in the MYSQL. However it is putting the header row as well. Also when ran twice (I believe it should overwrite the data) it results in the data been in the table twice.

+-----------+-------------+----------+--------+--------+---------------------+------+-------+---------+------+-------+---------+
| sensor_id | sensor_type | location | lat    | lon    | timestamp           | p1   | durp1 | ratiop1 | p2   | durp2 | ratiop2 |
+-----------+-------------+----------+--------+--------+---------------------+------+-------+---------+------+-------+---------+
| sensor_id | sensor_type | location | lat    | lon    | 2021-09-12 05:55:49 | P1   | durP1 | ratioP1 | P2   | durP2 | ratioP2 |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 70   |       |         | 20   |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 83   |       |         | 0.93 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.80 |       |         | 0.73 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.50 |       |         | 0.50 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 57   |       |         | 0.80 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.40 |       |         | 0.40 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.70 |       |         | 0.70 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 2    |       |         | 0.47 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 90   |       |         | 0.63 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.57 |       |         | 0.57 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:55:49 | 0.73 |       |         | 0.60 |       |         |
| sensor_id | sensor_type | location | lat    | lon    | 2021-09-12 05:58:02 | P1   | durP1 | ratioP1 | P2   | durP2 | ratioP2 |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 70   |       |         | 20   |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 83   |       |         | 0.93 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.80 |       |         | 0.73 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.50 |       |         | 0.50 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 57   |       |         | 0.80 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.40 |       |         | 0.40 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.70 |       |         | 0.70 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 2    |       |         | 0.47 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 90   |       |         | 0.63 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.57 |       |         | 0.57 |       |         |
| 9710      | SDS011      | 4894     | 43.226 | 27.934 | 2021-09-12 05:58:02 | 0.73 |       |         | 0.60 |       |         |
+-----------+-------------+----------+--------+--------+---------------------+------+-------+---------+------+-------+---------+

The data in HIVE is okay (no header row in there). What might cause this?

Also I have an exception but it completed overall, is this important?

21/09/12 05:57:41 INFO mapreduce.Job: Running job: job_1631399426919_0035
21/09/12 05:57:55 INFO mapreduce.Job: Job job_1631399426919_0035 running in uber mode : false
21/09/12 05:57:55 INFO mapreduce.Job:  map 0% reduce 0%
21/09/12 05:58:03 INFO mapreduce.Job:  map 100% reduce 0%
21/09/12 05:58:05 INFO mapreduce.Job: Job job_1631399426919_0035 completed successfully
21/09/12 05:58:06 INFO mapreduce.Job: Counters: 30
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=345759
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=2597
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=2
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
        Job Counters
                Launched map tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=4966
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=4966
                Total vcore-milliseconds taken by all map tasks=4966
                Total megabyte-milliseconds taken by all map tasks=1241500
        Map-Reduce Framework
                Map input records=12
                Map output records=12
                Input split bytes=1800
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=211
                CPU time spent (ms)=3490
                Physical memory (bytes) snapshot=217477120
                Virtual memory (bytes) snapshot=1972985856
                Total committed heap usage (bytes)=51380224
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=0
21/09/12 05:58:06 INFO mapreduce.ExportJobBase: Transferred 2.5361 KB in 62.3328 seconds (41.6635 bytes/sec)
21/09/12 05:58:06 INFO mapreduce.ExportJobBase: Exported 12 records.
21/09/12 05:58:06 INFO mapreduce.ExportJobBase: Publishing HCatalog export job data to Listeners
21/09/12 05:58:06 WARN mapreduce.PublishJobData: Unable to publish export data to publisher org.apache.atlas.sqoop.hook.SqoopHook
java.lang.ClassNotFoundException: org.apache.atlas.sqoop.hook.SqoopHook
        at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:264)
        at org.apache.sqoop.mapreduce.PublishJobData.publishJobData(PublishJobData.java:46)
        at org.apache.sqoop.mapreduce.ExportJobBase.runExport(ExportJobBase.java:457)
        at org.apache.sqoop.manager.SqlManager.exportTable(SqlManager.java:931)
        at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:81)
        at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:100)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:147)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:183)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:225)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:234)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:243)
21/09/12 05:58:06 DEBUG util.ClassLoaderStack: Restoring classloader: sun.misc.Launcher$AppClassLoader@4232c52b

Solution

  • Solution to your first problem - --hcatalog-database mydb --hcatalog-table airquality and remove --export dir parameter.

    Sqoop export cannot replace data. Pls issue a sqoop eval statement before loading main table to truncate it.

    sqoop eval --connect conn_parameters --username xx --password yy --query "truncate table mytab;"
    

    You can also use update statement to update the table too. https://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html
    Now, for your header issue, i think the original table may have header row. I am not sure about the data in original table. Check if the source table is properly defined in hive.