Search code examples
hadoopamazon-s3hivesqoopparquet

Loading data from SQL Server to S3 as parquet - AWS EMR


We have our data in SQL Server at the moment, we are trying to move them to our s3 bucket as parquet files. The intention is to analyse this s3 data in AWS EMR (Spark, Hive & Presto mainly). We don't want to store our data in HDFS.

  1. What are the choices here? so far from our knowledge, it seems we can use either spark or sqoop for this import. Though sqoop is faster than Spark in this case due to parallelism (parallel db connections), it seems writing parquet file from sqoop to s3 is not possible - Sqoop + S3 + Parquet results in Wrong FS error . Workaround is to move to hdfs and then to s3. However this seems to be non-efficient. How about using SparkSQL to pull this data from SQL Server and write as parquet in s3 ?

  2. Once we load this data as parquet in this format

    s3://mybucket/table_a/day_1/(parquet files 1 ... n).
    s3://mybucket/table_a/day_2/(parquet files 1 ... n).
    s3://mybucket/table_a/day_3/(parquet files 1 ... n).
    

How can I combine them together as a single table and query using Hive. I understand that we can create hive external table pointing to s3, but can we point to multiple files?

Thanks.

EDIT: Adding this as requested.

org.apache.hive.service.cli.HiveSQLException: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask at org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:380) at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:257) at org.apache.hive.service.cli.operation.SQLOperation.access$800(SQLOperation.java:91) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:348) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:362) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)


Solution

  • You can create the hive external table as follows

    create external table table_a (
     siteid                    string,
     nodeid                    string,
     aggregation_type          string
     )
     PARTITIONED BY (day string)
     STORED AS PARQUET
     LOCATION 's3://mybucket/table_a';
    

    Then you can run the following command to register the partition files stored under each days directory into HiveMatastore

     MSCK REPAIR TABLE table_a;
    

    Now you can access your files through hive queries. We have used this approach in our project and working well. After the above command, you can run the query

     select * from table_a where day='day_1';
    

    Hope this helps.

    -Ravi