Search code examples

Proper way to update an arrow dataset in R

I would like to know if there is a good practice to update an arrow dataset. Imagine I have data that I first write as follow:


td <- tempdir()

#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Create an arrow dataset with a partitioning based on cyl:

write_dataset(mtcars, td, partitioning = "cyl")

We can see that there are 3 folders, one for each cyl.

#> [1] "cyl=4" "cyl=6" "cyl=8"

Now, lets open the dataset and filter to only keep cyl == 6 and re-write it to the same folder:

open_dataset(td) |>
  filter(cyl == 6) |>
  write_dataset(td, partitioning = "cyl")

There are still 3 sub-folders:

#> [1] "cyl=4" "cyl=6" "cyl=8"

All the original data is still there because re-writing cyl == 6 did not remove cyl == 4 and cyl == 8:

open_dataset(td) |>
  distinct(cyl) |>
#> # A tibble: 3 × 1
#>     cyl
#>   <int>
#> 1     6
#> 2     4
#> 3     8

My question is how one would proceed to update an existing dataset?

Created on 2022-08-31 with reprex v2.0.2


  • Depends on what you mean by "update". For one understanding of "update", that's what you did: you overwrote the cyl=6 values and didn't touch any others.

    write_dataset() has an existing_data_behavior argument that governs this. Default is to overwrite individual files, but you can have it "error" if partition dirs already exist, or "delete_matching" would in this example delete cyl=6/* and write new files for that partition.

    See for details.