Search code examples
rapache-sparkrstudiosparklyrgzip

How to Open "GZ FILE" using sparklyr in R?


I'd like to open gz file using sparklyr package since I'm using Spark on R. I know that I can use read.delim2(gzfile("filename.csv.gz"), sep = ",", header = FALSE) to open gz file, and I can use spark_read_csv to open csv file but neither works when I tried to open the gz file in Spark. Please help!


Solution

  • Default Spark readers can load gzipped data transparently, without any additional configuration, as long a the file has proper extension indicating compression used.

    So if you have a gzipped file (note that such setup will work only in local mode. In distributed mode you need shared storage) like this:

    valid_path <- tempfile(fileext=".csv.gz")
    valid_conn <- gzfile(valid_path, "w")
    readr::write_csv(iris, valid_conn)
    close(valid_conn )
    

    spark_read_csv will work just fine:

    spark_read_csv(sc, "valid", valid_path)
    
    # Source: spark<valid> [?? x 5]
       Sepal_Length Sepal_Width Petal_Length Petal_Width Species
              <dbl>       <dbl>        <dbl>       <dbl> <chr>  
     1          5.1         3.5          1.4         0.2 setosa 
     2          4.9         3            1.4         0.2 setosa 
     3          4.7         3.2          1.3         0.2 setosa 
     4          4.6         3.1          1.5         0.2 setosa 
     5          5           3.6          1.4         0.2 setosa 
     6          5.4         3.9          1.7         0.4 setosa 
     7          4.6         3.4          1.4         0.3 setosa 
     8          5           3.4          1.5         0.2 setosa 
     9          4.4         2.9          1.4         0.2 setosa 
    10          4.9         3.1          1.5         0.1 setosa 
    

    However this

    invalid_path <- tempfile(fileext=".csv")
    invalid_conn <- gzfile(invalid_path, "w")
    readr::write_csv(iris, invalid_conn)
    close(invalid_conn)
    

    won't, as Spark will read data as-is

    spark_read_csv(sc, "invalid", invalid_path)
    

    Also please keep in mind, that gzip is not splittable, and as such a poor choice for distributed applications. So if the file is large, it typically makes sense to unpack it using standard system tools, before you proceed with Spark.