How to Open "GZ FILE" using sparklyr in R?

I'd like to open gz file using sparklyr package since I'm using Spark on R. I know that I can use read.delim2(gzfile("filename.csv.gz"), sep = ",", header = FALSE) to open gz file, and I can use spark_read_csv to open csv file but neither works when I tried to open the gz file in Spark. Please help!

Solution

Default Spark readers can load gzipped data transparently, without any additional configuration, as long a the file has proper extension indicating compression used.

So if you have a gzipped file (note that such setup will work only in local mode. In distributed mode you need shared storage) like this:

valid_path <- tempfile(fileext=".csv.gz")
valid_conn <- gzfile(valid_path, "w")
readr::write_csv(iris, valid_conn)
close(valid_conn )

spark_read_csv will work just fine:

spark_read_csv(sc, "valid", valid_path)

# Source: spark<valid> [?? x 5]
   Sepal_Length Sepal_Width Petal_Length Petal_Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa

However this

invalid_path <- tempfile(fileext=".csv")
invalid_conn <- gzfile(invalid_path, "w")
readr::write_csv(iris, invalid_conn)
close(invalid_conn)

won't, as Spark will read data as-is

spark_read_csv(sc, "invalid", invalid_path)

Also please keep in mind, that gzip is not splittable, and as such a poor choice for distributed applications. So if the file is large, it typically makes sense to unpack it using standard system tools, before you proceed with Spark.