Spark not same input/ouput directory size (for same data)

In order to reduce the number of blocks allocated by the NameNode. I'm trying to concatenate some small files to 128MB files. These small files are in gz format and the 128MB files must be in gz format too.

To accomplish this, I'm getting the sum size of all small files and divide this sum size(in MB) by 128 to get the number of files I need.

Then I perform a rdd.repartition(nbFiles).saveAsTextFile(PATH,classOf[GzipCodec])

The problem is that my output directory size is higher thant my input directory size (10% higher). I tested with default and best compression level and I'm always getting an higher output size.

I have no idea why my output directory is getting higher than my input directory but I imagine it's linked to the fact that i'm repartitioning all the files of the input directory.

Can someone help me to understand why i'm getting this result?

Thanks :)

Solution

Level of compression will depend on the data distribution. When you rdd.repartition(nbFiles) you randomly shuffle all the data so if there was some structure in the input, which reduced entropy and enabled better compression, it will be lost.

You can try some other approach, like colaesce without shuffle or sorting to see if you can get a better result.