amazon-web-services hadoop hive amazon-emr

Hadoop hive not scaling on AWS EMR

I'm running an experiment on hadoop hive. In this experiment I run the same hive job on 2 different hardware setups. This is hosted in AWS EMR. Here is the hive.sql script I run:

DROP DATABASE IF EXISTS labtest;

CREATE DATABASE labtest;

CREATE TABLE IF NOT EXISTS laborder (Insertts TIMESTAMP, ordernr STRING, Patientnr STRING, visitnr STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "quoteChar"     = "\""
) ;
LOAD DATA INPATH '${INPUT}/laborder.csv' OVERWRITE INTO TABLE laborder;

CREATE TABLE IF NOT EXISTS labanalyse (resultaat STRING, deleted BOOLEAN,  analysecodePk INT, Inserttimestamp TIMESTAMP,
specimennr STRING, uitvoeringsts TIMESTAMP
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "quoteChar"     = "\""
) ;
LOAD DATA INPATH '${INPUT}/labresult.csv' OVERWRITE INTO TABLE laborder;

CREATE TABLE IF NOT EXISTS labspecimen (specimennr STRING, ordernr STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "quoteChar"     = "\""
) ;
LOAD DATA INPATH '${INPUT}/labspecimen.csv' OVERWRITE INTO TABLE labspecimen;


CREATE TABLE IF NOT EXISTS labanalysecode (pk INT, analysecode STRING, analysecodeversion STRING, view INT, referencevalue STRING, unit STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "quoteChar"     = "\""
) ;
LOAD DATA INPATH '${INPUT}/labordercodes.csv' OVERWRITE INTO TABLE labanalysecode;


SELECT * FROM laborder 
INNER JOIN labspecimen ON labspecimen.ordernr = laborder.ordernr
INNER JOIN labanalyse ON labanalyse.specimennr = labspecimen.specimennr
INNER JOIN labanalysecode ON labanalysecode.pk = labanalyse.analysecodepk;

I run this script with the following dataset size:

laborder = 40 MB
labresult = 150 MB
labspecimen = 46 MB

This job runs in about 40 seconds. My expectation is that running this on a setup with 1 master node and 2 core nodes is slower than running it on a 1 master/4 core nodes system. The difference however is minimal. Any clues to what I'm doing wrong? What can I do to make better use of multiple machines?

Solution

It appears that each of your inputs is a single file. This is a very inefficient way to operate Hive because it is designed to process data in parallel.

Best-practice is to point it to a directory with many files, at least one file for each node, or better one file for each slice that can run across the cluster.

Also, your input data is very small. Hadoop and Hive is designed to work across GBs and even TBs of data. Something as small as 40 MB is not a good use-case for Hive where the overhead of starting a job can take longer than actually processing the data.

If your data is that small, use a traditional database or even an Excel spreadsheet!