Search code examples
mysqlsqoopmysqlimport

Write MapReduce ETL job


I have a table of ~300GB of data in mysql. I want to add new column in table. When I alter table using alter command, it never completes and process dies. So I was planning on writing SQOOP job to get all data from table and dump it into HDFS and then create new DB in mysql and create table with additional column and re-import data from HDFS again.

table structure:
CREATE TABLE `nodes` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `type` int(11) NOT NULL,
  `name` varchar(50) NOT NULL,
  `postcode` varchar(4) DEFAULT NULL,
  `updated` datetime DEFAULT NULL,
  `db_updated` datetime DEFAULT CURRENT_TIMESTAMP,
  `useragent` bigint(20) DEFAULT NULL,
  `last_seen` date DEFAULT NULL, --newly added column
  PRIMARY KEY (`id`),
  UNIQUE KEY `akaid_index` (`type`,`name`),
  KEY `useragent_idx` (`useragent`),
  KEY `type` (`type`),
  CONSTRAINT `useragentfk` FOREIGN KEY (`useragent`) REFERENCES `useragents` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=1091725696 DEFAULT CHARSET=latin1;

SQOOP command:

sqoop export --connect jdbc:mysql://localhost:3306/graph1 --table nodes --username root --password password --export-dir <dir-path> --input-fields-terminated-by ','

I'm getting below error:

Error: java.io.IOException: Can't export data, please check failed map task logs
    at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:112)
    at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:39)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
    at org.apache.sqoop.mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:64)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.RuntimeException: Can't parse input data: 'NULL'
    at nodes.__loadFromFields(nodes.java:470)
    at nodes.parse(nodes.java:388)
    at org.apache.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:83)
    ... 10 more
Caused by: java.lang.NumberFormatException: For input string: "NULL"
    at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
    at java.lang.Long.parseLong(Long.java:441)
    at java.lang.Long.valueOf(Long.java:540)
    at nodes.__loadFromFields(nodes.java:467)
    ... 12 more

File in HDFS contains following records:

1289603991,1,fee1cee723bdb0bc499c443765b40e3d,,2016-04-13 10:19:59,2016-04-14 03:44:55,5296252
1289603992,1,edf65c2e7b89388fe9068cc3a898a3fd,,2016-04-13 10:20:00,2016-04-14 03:44:55,5411481
1289603993,1,5760fd1cca92a65ce6f2db43853fc118,,2016-04-13 10:19:59,2016-04-14 03:44:55,4441745
1289603994,1,65dd92c80df5581f55bc60f3e997ec05,,2016-04-13 10:19:59,2016-04-14 03:44:55,5332084
1289603995,1,7654a84428f3064828f5972cfce5f8e6,,2016-04-13 10:20:00,2016-04-14 03:44:55,5202243
1289603996,1,84c270212fe5f3a52cb2bd75403da058,,2016-04-13 10:20:00,2016-04-14 03:44:55,5398729
1289603997,1,a486382c4fc296a5e8d3c0491568c22c,,2016-04-13 10:19:57,2016-04-14 03:44:55,5289170
1289603998,112,2_3Nns7YXPmS_xv3imJBiw04BQf1sNc2tJrtFJ5TCx98,,2016-04-13 10:20:00,2016-04-14 03:44:55,NULL
1289603999,1,a3607df77e025b12c1728f62589857fa,,2016-04-13 10:19:59,2016-04-14 03:44:55,12
1289604000,113,570e1d4e6372cd9c,,2016-04-13 10:19:59,2016-04-14 03:44:55,NULL
1289604001,113,57023dd016258fbf,,2016-04-13 10:20:00,2016-04-14 03:44:55,NULL

After adding --input-null-string and --input-null-non-string sqoop job works but now it fails for following reason:

2016-06-07 18:11:37,750 ERROR [Thread-9] org.apache.sqoop.mapreduce.AsyncSqlOutputFormat: Got exception in update thread: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry '1289603991' for key 'PRIMARY'
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
    at com.mysql.jdbc.Util.handleNewInstance(Util.java:404)
    at com.mysql.jdbc.Util.getInstance(Util.java:387)
    at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:934)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3966)
    at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3902)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2526)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2673)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
    at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
    at com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1192)
    at org.apache.sqoop.mapreduce.AsyncSqlOutputFormat$AsyncSqlExecThread.run(AsyncSqlOutputFormat.java:233)

Since table has AUTO_INCREMENT column I am getting above error. Is there any other way to alter this table and make it fast? As per this thread(Changing Large MySQL InnoDB Tables) it seems large InnoDB tables takes more time even days.

If you can suggest any alternate suggestion it would be good.


Solution

  • Your data violate the primary key constraint:

    If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints.

    You need to truncate your table before executing exporting command, or to update existing data using this argument :

    --update-key id