Is there a way to change data types of columns when reading parquet files?
I'm using the spark_read_parquet
function from Sparklyr, but it doesn't have the columns
option (from spark_read_csv
) to change it.
In csv files, I would do something like:
data_tbl <- spark_read_csv(sc, "data", path, infer_schema = FALSE, columns = list_with_data_types)
How could I do something similar with parquet files?
Specifying data types only makes sense when reading a data format that does not have built in metadata on variable types. This is the case with csv or fwf files, which, at most, have variable names in the first row. Thus the read functions for such files have that functionality.
This sort of functionality does not make sense for data formats that have built in variable types, such as Parquet (or .Rds and .Rds in R).
This in this case you should:
a) read the Parquet file into Spark b) make the necessary data transformations c) save the transformed data into a Parquet file, overwriting the previous file