Search code examples
apache-sparkparquet

Most efficient way to get a single sample row from parquet files


I need to be able to get a single sample row from a number of datasets stored in parquet format.I dont know how the parquet files have been generated (i.e. what key was used to optimise storage). This operation needs to be as quick and efficient as possible. The issue is that many of these datasets are terrabytes in size and are split into thousands of parquet files. If possible I need to avoid reading the entire dataset into memory just to get a single row. I am aware of spark methods like using limit and take, but these involve creating a spark dataframe/dataset by loading the whole dataset and them limiting the returned value. So my question is is there an efficient way to perform this task or does the parquet column optimised format make this task inherently costly.


Solution

  • Use:

    val singleRow: Row = myDataFrame.head()
    

    It will return the first row of your data without loading anything more in memory. If your input data is a folder filled with multiple files, it will return the first row of the first file when sorted alphabetically.

    Note: There is also the Dataset.first method wich is simply an alias:

    def first(): T = head()