Search code examples
sql-serverhadoopsqoop

How to reduce log size for sqoop export


Is there a way to control the size of the logs created by sqoop export? Trying to export a series of parquet files from a hadoop cluster to microsoft sql server and finding that after a certain point in the mapper job, progress becomes very slow/freezes. Current theory from looking at the hadoop Resourcemanager is that the logs from the sqoop job are filling up to a size that causes the process to freeze.

New to hadoop and any advice would be appreciated. Thanks.

Update

Looking at the syslog output for one of the frozen map task jobs for the sqoop jar application from the resource manager web interface, the log output looks like:

2017-11-14 16:26:52,243 DEBUG [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 8758 8840 ]
2017-11-14 16:26:52,243 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: reading next wrapped RPC packet
2017-11-14 16:26:52,243 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 sending #280
2017-11-14 16:26:52,243 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.security.SaslRpcClient: wrapping token of length:751
2017-11-14 16:26:52,246 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: unwrapping token of length:62
2017-11-14 16:26:52,246 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 got value #280
2017-11-14 16:26:52,246 DEBUG [communication thread] org.apache.hadoop.ipc.RPC: Call: statusUpdate 3
2017-11-14 16:26:55,252 DEBUG [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: [ 8758 8840 ]
2017-11-14 16:26:55,252 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: reading next wrapped RPC packet
2017-11-14 16:26:55,252 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 sending #281
2017-11-14 16:26:55,252 DEBUG [IPC Parameter Sending Thread #0] org.apache.hadoop.security.SaslRpcClient: wrapping token of length:751
2017-11-14 16:26:55,254 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.security.SaslRpcClient: unwrapping token of length:62
2017-11-14 16:26:55,255 DEBUG [IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490] org.apache.hadoop.ipc.Client: IPC Client (<ipc_client_num>) connection to /<myipaddress>:23716 from job_1502069985038_3490 got value #281
2017-11-14 16:26:55,255 DEBUG [communication thread] org.apache.hadoop.ipc.RPC: Call: statusUpdate 3

Furthermore, letting the process run throughout the day, it seems that the sqoop job does indeed finish, but takes a very long time (~4 hours for ~500MB of .tsv data).


Solution

  • In response to the title of the posted question, the way to control the log output of the sqoop command is either by editing the log4j.properties file in the $HADOOP_HOME/etc/hadoop directory (since sqoop apparently uses this to inherit its log properties (though from what I can tell this may not be the case with sqoop2)) or by using generic arguments in the sqoop call with the -D prefix, eg:

    sqoop export \
        -Dyarn.app.mapreduce.am.log.level=WARN\
        -Dmapreduce.map.log.level=WARN \
        -Dmapreduce.reduce.log.level=WARN \
        --connect "$connectionstring" \
        --driver com.microsoft.sqlserver.jdbc.SQLServerDriver \
        --table $tablename \
        --export-dir /tmp/${tablename^^}_export \
        --num-mappers 24 \
        --direct \
        --batch \
        --input-fields-terminated-by '\t'
    

    However, my initial theory from the body of the post, that the logs from the sqoop job were filling up to a size that caused the process to freeze, did not seem to hold up. The log sizes for the map tasks fell to 0 bytes in the resourcemanager ui, but the system still went through the flow of running well up to a certain percent then dropping to a very slow speed.