Search code examples
rapache-sparkpysparkdatabricksrdata

How to convert .rdata file to parquet in Azure data lake using databricks?


So I have a few large .rdata files that were generated through use of the R programming language. I currently have uploaded them to azure data lake using Azure storage explorer. But I have to convert these rdata files to parquet format and then reinsert them into the data lake. How would I go about doing this? I can't seem to find any information about converting from rdata to parquet.


Solution

  • If you can use python, there are some libraries, like pyreadr, to load rdata files as pandas dataframes. You can then write to parquet using pandas or convert to pyspark dataframe. Something like this:

    import pyreadr
    
    result = pyreadr.read_r('input.rdata')
    
    print(result.keys())  # check the object name
    df = result["object"]  # extract the pandas data frame for object name
    
    sdf = spark.createDataFrame(df)
    
    sdf.write.parquet("output")