amazon-web-services hadoop mapreduce hive elastic-map-reduce

Hive query throwing exception - Error while compiling statement: FAILED: ArrayIndexOutOfBoundsException null

I just upgraded hive version to 2.1.0 for both hive-exec and hive-jdbc.

But because of this, some queries started failing that previously working fine.

Exception -

Exception in thread "main" org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ArrayIndexOutOfBoundsException null
    at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:264)
    at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:250)
    at org.apache.hive.jdbc.HiveStatement.runAsyncOnServer(HiveStatement.java:309)
    at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:250)
    at com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesInternal(HiveQueryExecutor.java:234)
    at com.XXX.YYY.executors.HiveQueryExecutor.executeQueriesMetricsEnabled(HiveQueryExecutor.java:184)
    at com.XXX.YYY.executors.HiveQueryExecutor.main(HiveQueryExecutor.java:500)
Caused by: org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: FAILED: ArrayIndexOutOfBoundsException null
    at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:387)
    at org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:186)
    at org.apache.hive.service.cli.operation.SQLOperation.runInternal(SQLOperation.java:269)
    at org.apache.hive.service.cli.operation.Operation.run(Operation.java:324)
    at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:460)
    at org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:447)
    at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
    at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
    at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
    at com.sun.proxy.$Proxy33.executeStatementAsync(Unknown Source)
    at org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:294)
    at org.apache.hive.service.cli.thrift.ThriftCLIService.ExecuteStatement(ThriftCLIService.java:497)
    at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1437)
    at org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement.getResult(TCLIService.java:1422)
    at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
    at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
    at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
    at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArrayIndexOutOfBoundsException: null

Query that I ran -

INSERT OVERWRITE TABLE base_performance_order_20160916
SELECT 
*
 FROM 
(
select
coalesce(traffic_feed.sku,commerce_feed.sku) AS sku,
concat(coalesce(traffic_feed.feed_date,commerce_feed.feed_date),' ','00:00:00') AS transaction_date,
commerce_feed.units AS gross_units,
commerce_feed.orders AS gross_orders,
commerce_feed.revenue AS gross_revenue,
NULL AS gross_cost,
NULL AS gross_subsidized_cost,
NULL AS gross_shipping_cost,
NULL AS gross_variable_cost,
NULL AS gross_shipping_charges,
traffic_feed.pageViews AS page_views,
traffic_feed.uniqueVisitors AS unique_visits,
0 AS channel_id,
concat(coalesce(traffic_feed.feed_date,commerce_feed.feed_date),' ','00:00:00') AS feed_date,
from_unixtime(unix_timestamp()) AS creation_date
from traffic_feed
full outer join commerce_feed on coalesce(traffic_feed.sku)=commerce_feed.sku AND coalesce(traffic_feed.feed_date)=commerce_feed.feed_date
) tb
WHERE sku is not NULL and transaction_date is not NULL and channel_id is not NULL and feed_date is not NULL and creation_date is not NULL

It is working fine when I ran this query without setting any hive variables.

But when I set below Hive Configuration Properties -

"set hivevar:hive.mapjoin.smalltable.filesize=2000000000",
                "set hivevar:mapreduce.map.speculative=false",
                "set hivevar:mapreduce.output.fileoutputformat.compress=true",
                "set hivevar:hive.exec.compress.output=true",
                "set hivevar:mapreduce.task.timeout=6000000",
                "set hivevar:hive.optimize.bucketmapjoin.sortedmerge=true",
                "set hivevar:io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec",
                "set hivevar:hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat",
                "set hivevar:hive.auto.convert.sortmerge.join.noconditionaltask=true",
                "set hivevar:FEED_DATE=20160916",
                "set hivevar:hive.optimize.bucketmapjoin=true",
                "set hivevar:hive.exec.compress.intermediate=true",
                "set hivevar:hive.enforce.bucketmapjoin=true",
                "set hivevar:mapred.output.compress=true",
                "set hivevar:mapreduce.map.output.compress=true",
                "set hivevar:hive.auto.convert.sortmerge.join=true",
                "set hivevar:hive.auto.convert.join=false",
                "set hivevar:mapreduce.reduce.speculative=false",
                "set hivevar:PD_KEY=vijay-test-mail@XXXcommerce.pagerduty.com",
                "set hivevar:mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec",
                "set hive.mapjoin.smalltable.filesize=2000000000",
                "set mapreduce.map.speculative=false",
                "set mapreduce.output.fileoutputformat.compress=true",
                "set hive.exec.compress.output=true",
                "set mapreduce.task.timeout=6000000",
                "set hive.optimize.bucketmapjoin.sortedmerge=true",
                "set io.compression.codecs=org.apache.hadoop.io.compress.GzipCodec",
                "set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat",
                "set hive.auto.convert.sortmerge.join.noconditionaltask=true",
                "set FEED_DATE=20160916",
                "set hive.optimize.bucketmapjoin=true",
                "set hive.exec.compress.intermediate=true",
                "set hive.enforce.bucketmapjoin=true",
                "set mapred.output.compress=true",
                "set mapreduce.map.output.compress=true",
                "set hive.auto.convert.sortmerge.join=true",
                "set hive.auto.convert.join=false",
                "set mapreduce.reduce.speculative=false",
                "set PD_KEY=vijay-test-mail@XXXcommerce.pagerduty.com",
                "set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec"

It started failing with above mentioned exceptions.

Questions-

Which Hive Configuration Properties that I set is creating problem (I upgraded version of hive and hadoop only) ?

Solution

Try disabling the sort merge join property which is an interim solution.

Since you have enabled the sort merge join property as true, this will by default consider the io.sort.mb as 2047 MB and this might lead to the Arrayindexoutofbound exception. So when you set the sort merge join property it is advised to set the sort.io.mb property also with the optimum value based on your dataset size used in the query.

To know how much data size the query takes , you could explain the query : explain which shows how much data volume is considered in each sub query and stages.

Hope this helps.