Search code examples
rdatasetapache-arrow

What if groups has fewer rows than min_rows_per_group in arrow parquet dataset


I need to subset an arrow dataset according to some columns. I am using the write_dataset function in the R arrow package with the partitioning set on the columns I want to use to subset the dataset.

If I set the min_rows_per_group, let's say to 10, do all the groups that have less than ten raws get discarded?

(Filtering the groups with less than ten entries is something I'd like to do).

Thanks!


Solution

  • write_dataset should not drop rows regardless of configuration.

    Once the end of the input is reached then any queued writes (due to min_rows_per_group or min_rows_per_file) which were waiting for more data will be launched with whatever they have.

    If your goal is to filter groups with less than 10 entries then you can probably accomplish this with dplyr. Something like this should give reasonable performance:

    > arrow::write_parquet(mtcars, '/tmp/mtcars.parquet')
    > ds = arrow::open_dataset('/tmp/mtcars.parquet')
    > counts = ds %>% count(cyl) %>% filter(n > 10)
    > ds %>% inner_join(counts, by="cyl") %>% collect()
        mpg cyl  disp  hp drat    wt  qsec vs am gear carb  n
    1  22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1 11
    2  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2 14
    3  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4 14
    4  24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2 11
    5  22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2 11
    6  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3 14
    7  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3 14
    8  15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3 14
    9  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4 14
    10 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4 14
    11 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4 14
    12 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1 11
    13 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2 11
    14 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 11
    15 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1 11
    16 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2 14
    17 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2 14
    18 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4 14
    19 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2 14
    20 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1 11
    21 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2 11
    22 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2 11
    23 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4 14
    24 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8 14
    25 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2 11