Search code examples
bashhadoopsqoopimpalaparquet

sqoop create impala parquet table


I'm relatively new the process of sqooping so pardon any ignorance. I have been trying to sqoop a table from a data source as a parquet file and create an impala table (also as parquet) into which I will insert the sqooped data. The code runs without an issue, but when I try to select a couple rows for testing I get the error:

.../EWT_CALL_PROF_DIM_SQOOP/ec2fe2b0-c9fa-4ef9-91f8-46cf0e12e272.parquet' has an incompatible Parquet schema for column 'dru_id.test_ewt_call_prof_dim_parquet.call_prof_sk_id'. Column type: INT, Parquet schema: optional byte_array CALL_PROF_SK_ID [i:0 d:1 r:0]

I was mirroring the process I found on a cloudera guide here:https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_create_table.html. Mainly the "Internal and External Tables" section. I've been trying to avoid having to infer the schema with a particular parquet file, since this whole thing will be kicked off every month with a bash script (and I also can't think of a way to point it to just one file if I use more than one mapper).

Here's the code I used. I feel like I'm either missing something small and stupid, or I've screwed up everything major without realizing it. Any and all help appreciated. thanks!

    sqoop import -Doraoop.import.hint=" " \
    --options-file /home/kemri/pass.txt \
    --verbose \
    --connect jdbc:oracle:thin:@ldap://oid:389/cn=OracleContext,dc=[employer],dc=com/EWSOP000 \
    --username [userid] \
    --num-mappers 1 \
    --target-dir hdfs://nameservice1/data/res/warehouse/finance/[dru_userid]/EWT_CALL_PROF_DIM_SQOOP \
    --delete-target-dir \
    --table DMPROD.EWT_CALL_PROF_DIM \
    --direct \
    --null-string '\\N' \
    --null-non-string '\\N' \
    --as-parquetfile 


impala-shell -k -i hrtimpslb.[employer].com


create external table test_EWT_CALL_PROF_DIM_parquet(
CALL_PROF_SK_ID INT,
SRC_SKL_CD_ID STRING,
SPLIT_NM STRING,
SPLIT_DESC STRING,
CLM_SYS_CD STRING,
CLM_SYS_NM STRING,
LOB_CD STRING,
LOB_NM STRING,
CAT_IND STRING,
CALL_TY_CD STRING,
CALL_TY_NM STRING,
CALL_DIR_CD STRING,
CALL_DIR_NM STRING,
LANG_CD STRING,
LANG_NM STRING,
K71_ATOMIC_TS TIMESTAMP)
stored as parquet location '/data/res/warehouse/finance/[dru_userid]/EWT_CALL_PROF_DIM_SQOOP';

Solution

  • As per request in the comments I provide an example of how you could achieve the same using one sqoop import with --hive-import. For obvious reasons I haven't tested it for your specific requirements, so it could need some more tuning which is often the case with these sqoop commands. In my experience importing as parquet forces you to use the --query option since it doesn't allow you to use schema.table as table.

    sqoop import -Doraoop.import.hint=" "\
    --verbose \
    --connect jdbc:oracle:thin:@ldap://oid:389/cn=OracleContext,dc=[employer],dc=com/EWSOP000 \
    --username [userid] \
    -m 1 \
    --password [ifNecessary] \
    --hive-import \
    --query 'SELECT * FROM DMPROD.EWT_CALL_PROF_DIM WHERE $CONDITIONS' \
    --hive-database [database you want to use] \
    --hive-table test_EWT_CALL_PROF_DIM_parquet \
    --target-dir hdfs://nameservice1/data/res/warehouse/finance/[dru_userid]/EWT_CALL_PROF_DIM_SQOOP \
    --null-string '\\N' \
    --null-non-string '\\N' \
    --as-parquetfile
    

    Basically what you need for --hive-import is --hive-database, --hive-table and --query. If you don't want all your columns to appear in Hive as strings you must also include:

    --map-hive-columns [column_name1=Timestamp,column_name2=Int,...]
    

    You might need a similar --map-java-columns as well, but I'm never sure when this is required. You will need a --split-by if you want multiple mappers

    As discussed in the comments you will need to use invalidate metadata db.table to make sure Impala sees these changes. You could issue both commands from CL or a single bash-script where you can issue the impala command using impala-shell -q [query].