What is actually meant when referring to parquet row-group size?

I am starting to work with the parquet file format. The official Apache site recommends large row groups of 512MB to 1GB (here). Several online source (e.g. this one) suggest that the default row group size is 128MB.

I have a large number of parquet files which I will later process downstream with PySpark on AWS Glue. These files have very small row-groups. I cannot control the files I'm starting with, but want to combine row-groups so as to have "more efficient" files prior to downstream processing (why? these files will be uploaded to S3 and processed with Spark; my understanding is that Spark will read one row-group at a time, so many more smaller row-groups results in increased IO operations which is inefficient; if this assumption is invalid please educate me).

Let's consider just one of these files for this question. It's compressed (with snappy compression) and 85MB on disk. When I inspect its schema using the pqrs tool it reports that the file has 55,733 records in 1,115 row groups, and each row group seems to be around 500 kB - specifically, something like this:

row group 7:
--------------------------------------------------------------------------------
total byte size: 424752
num of rows: 50

If I simply take (1115 row-groups * 500 kB/row-group) that's around 500MB; whereas the file on disk is 85MB. Granted, some of the row-groups are smaller than 500kB but I eyeballed around 100 of them (half at top, half at bottom) and they're in that general ballpark.

Sub-question 1: is the difference (500MB calculated vs 85MB actual) because the row-group size reported by pqrs actually represents the uncompressed size, maybe what would be the in-memory size of the row-group (which presumably would be larger than the compressed serialized size on disk)? In other words I can't do a simplistic 1115 * 500 but have to apply some sort of compression factor?

Sub-question 2: when I see that the recommended batch size is 128MB, what exactly does that refer to? The uncompressed in-memory size? The serialized, compacted size on disk? Something else? How does it relate to what's reported by pqrs?

My (simplified) code to compact these row-groups is:

import pyarrow.dataset as ds
import pyarrow.parquet as pq

def compact_parquet_in_batches(infile, outfile, batchsize):
    parquet_file = pq.ParquetFile(infile)
    ds.write_dataset(
        parquet_file.iter_batches(batch_size=batchsize), 
        outfile,
        schema=RSCHEMA,
        format='parquet'
    )

Main question: What should batchsize be?

iter_batches takes batch_size as a number of records rather than a byte size. I could calculate it from total records and desired # of batches, but I'm unclear what I should be calculating for here.

I tried this:

required # batches = file size on disk in MB / 128 = 85/128 = 1 (rounded up)
batch size = # records / required # batches = 55,733 / 1 = 60000 (rounded up to next 10k)

When I run my code with batch size of 60k:

I get two record groups (great, 1,115 is down to 2; but why not to 1?)
the reported byte size of the first record group is around 250MB. So even though it ended up creating twice the number of row-groups I expected, instead of each being half the size I expected they are actually double the size I expected.

row group 0:
--------------------------------------------------------------------------------
total byte size: 262055359
num of rows: 32768

I figure some of my assumptions - or understanding about the parquet file format, the pqrs tool or the pyarrow library - are off. Can someone please demystify me?

Solution

TL;DR - 1 Mi rows

Your understanding is roughly correct. Different tools have different recommendations and some tools (e.g. pyarrow) will use # of rows to determine row group size and other tools (e.g. parquet-mr, the java parquet implementation used by spark) will use # of bytes.

is the difference (500MB calculated vs 85MB actual) because the row-group size reported by pqrs actually represents the uncompressed size

Yes. Tools are often not very clear on this. I find the parquet thrift definition to be a good source of ground truth when dealing with parquet metadata fields.

struct RowGroup {
  /** Metadata for each column chunk in this row group.
   * This list must have the same order as the SchemaElement list in FileMetaData.
   **/
  1: required list<ColumnChunk> columns

  /** Total byte size of all the uncompressed column data in this row group **/
  2: required i64 total_byte_size

when I see that the recommended batch size is 128MB, what exactly does that refer to? The uncompressed in-memory size? The serialized, compacted size on disk? Something else? How does it relate to what's reported by pqrs?

Main question: What should batchsize be?

The answer usually comes down to ensuring that you are making I/O requests that are ideal for your storage system. However, if your row groups are very small (e.g. 100, 1k, 10k rows) then it probably doesn't matter what your storage is (both because row groups introduce extra compute and because row groups affect the metadata / data ratio). These very small sizes are almost always universally bad for performance.

If you are using HDFS I believe the rules may be slightly different. I don't have much experience with HDFS. In all other cases you generally want row groups to be large enough that your I/O requests are big enough to satisfy your storage system.

For example, when reading from a HDD, if you do a bunch of random 64 byte reads you will get worse performance than a bunch of sequential 64 byte reads. However, if you do a bunch of random 4MiB reads then you should get roughly the same performance as a bunch of sequential 4MiB reads. This will depend on the hard drive of course but I've found 4MiB to be a good number for HDD. On the other hand, if you are reading from S3, then their guidelines recommend 8-16MiB.

Translating this to row group size is a bit tricky and will depend on your query habits. If you normally read all columns from a file then you will want your row group to be 8-16MiB. On the other hand, if you normally only read "some" columns from a file then you want each column to be 8-16MiB.

Now things get tricky because we have to think about compression and encoding. For example, a boolean column is almost never 8MiB. You would need at least 64Mi rows and possibly quite a bit more due to compression. A float32 column is a little easier to reason with. You should get an 8MiB read with 2Mi rows and, in many cases, you don't get much compression from this.

All of the above is the theory. In practice I have done a fair amount of benchmarking, both on local disks, and on S3, and I have found that 1Mi rows is generally a good size for a row group. There are probably cases where larger row groups are a good idea and you can still get pretty good performance with smaller row groups. You'll want to eventually benchmark for your own personal use case. However, 1Mi is a nice round number that is easy to remember. If you need to express your row group size in # of uncompressed bytes then it depends on how many columns you have. Again, as a rule of thumb, we can assume columns are 4 bytes, and so you can use the calculation:

# of bytes = 1Mi * # of columns * 4

In other words, if you have 10 columns, then aim for row groups with at least 40MiB.

What if I make row groups too large?

Given the above, it may seem simple to just make the row groups massive. This will ensure that you have ideal I/O requests. In a perfect world, were all parquet readers created equal, then I would say this is correct (1 row group per file is ideal).

However, many parquet readers will use row groups either as:

The unit of parallelism - In this case you won't get enough parallelism if you only have one row group in your file.
The minimum size of a read - In this case the RAM consumption of your reader will be very high, even if you are doing streaming processing.

For these reasons you will typically want to avoid overly large row group sizes.

What about pushdown filtering / statistics / etc.

This is typically another reason to keep row groups small. Row group statistics are the easiest pushdown filtering tool to use and some readers rely exclusively on this. This means that the pushdown is only capable of filtering out entire row groups based on the filter. So smaller row groups means you have a better chance of eliminating I/O entirely.

Fortunately, parquet reading tools have been slowly moving towards using page-level statistics (or page level bloom filters) to do this filtering. Pages are quite small (~1MiB) and can offer a very good resolution for filtering (though in some rare cases it is too fine of a resolution as it require more metadata processing). If your parquet reader is able to utilize page level statistics for pushdown then the row group size should have no affect on pushdown.

Any kind of skipping or loading of single rows will be unrelated to row groups sizes. Parquet readers should be capable of applying skips at page-level resolution.

Disclaimer: I work on arrow-c++/pyarrow. The pyarrow datasets reader's performance is very dependent on row group size (I'm slowly trying to fix this) for some of the reasons I describe above.