Search code examples
hivehadoop-yarnemrparquettez

Parquet Warning Filling up Logs in Hive MapReduce on Amazon EMR


I am running a custom UDAF on a table stored as parquet on Hive on Tez. Our Hive jobs are run on YARN, all set up in Amazon EMR. However, due to the fact that the parquet data we have was generated with an older version of Parquet (1.5), I am getting a warning that is filling up the YARN logs and causing the disk to run out of space before the job finishes.

This is the warning:

PM WARNING: org.apache.parquet.CorruptStatistics: Ignoring statistics because created_by could not be parsed (see PARQUET-251): parquet-mr version

It also prints a stack track. I have been trying to silence the warning logs to no avail. I have managed to turn off just about every type of log except this warning. I have tried modifying just about every Log4j settings file using the AWS config as outlined here.

Things I have tried so far:

  1. I set the following settings in tez-site.xml (writing them in JSON format because that's what AWS requires for configuration) It is in proper XML format of course on the actual instance.

    "tez.am.log.level": "OFF",
    "tez.task.log.level": "OFF",
    "tez.am.launch.cluster-default.cmd-opts": "-Dhadoop.metrics.log.level=OFF -Dtez.root.logger=OFF,CLA",
    "tez.task-specific.log.level": "OFF;org.apache.parquet=OFF"
    
  2. I have the following settings on mapred-site.xml. These settings effectively turned off all logging that occurs in my YARN logs except for the warning in question.

      "mapreduce.map.log.level": "OFF",
      "mapreduce.reduce.log.level": "OFF",
      "yarn.app.mapreduce.am.log.level": "OFF"
    
  3. I have these settings in just about every other log4j.properties file .I found in the list shown in previous AWS link.

      "log4j.logger.org.apache.parquet.CorruptStatistics": "OFF",
      "log4j.logger.org.apache.parquet": "OFF",
      "log4j.rootLogger": "OFF, console"
    

Honestly at this point, I just want to find some way turn off logs and get the job running somehow. I've read about similar issues such as this link where they fixed it by changing log4j settings, but that's for Spark and it just doesn't seem to be working on Hive/Tez and Amazon. Any help is appreciated.


Solution

  • Ok, So I ended up fixing this by modifying the java logging.properties file for EVERY single data node and the master node in EMR. In my case the file was located at /etc/alternatives/jre/lib/logging.properties

    I added a shell command to the bootstrap action file to automatically add the following two lines to the end of the properties file:

    org.apache.parquet.level=SEVERE

    org.apache.parquet.CorruptStatistics.level = SEVERE

    Just wanted to update in case anyone else faced the same issue as this is really not set up properly by Amazon and required a lot of trial and error.