Search code examples
apache-pig

How to prevent Apache pig from outputting empty files?


I have a pig script that reads data from a directory on HDFS. The data are stored as avro files. The file structure looks like:

DIR--
   --Subdir1
   --Subdir2
   --Subdir3
   --Subdir4

In the pig script I am simply doing a load, filter and store. It looks like:

items = LOAD path USING AvroStorage()
items = FILTER items BY some property
STORE items into outputDirectory using AvroStorage()

The problem right now is that pig is outputting many empty files in the output directory. I am wondering if there's a way to remove those files? Thanks!


Solution

  • For pig version 0.13 and later, you can set pig.output.lazy=true to avoid creating empty files. (https://issues.apache.org/jira/browse/PIG-3299)