Search code examples
excelpysparkazure-databricks

Reading excel files in pyspark with 3rd row as header


I want to read read excel files as spark dataframe with 3rd row as a header.The synatax to read excel files as spark dataframe with 1st row as header is :

s_df  = spark.read.format("com.crealytics.spark.excel") \
                           .option("header", "true") \
                           .option("inferSchema", "true") \
                           .load(path + 'Sales.xlsx')

and the equivalent syntax to read as pandas dataframe with 3rd row as header is :

p_df = pd.read_excel(path + 'Sales.xlsx',header=3)

I want to do the same thing in pyspark that is to read excel files as spark dataframe with 3rd row as header.


Solution

  • Use the dataAddress option to specify the cell/row where the data is located . As you need to skip two rows, your data (including header) starts from row A3.

    s_df = spark.read.format("com.crealytics.spark.excel") \
               .option("header", "true") \
               .option("inferSchema","true") \
               .option("dataAddress", "'Sheet1'!A3") \
               .load("yourfilepath")
    

    Also, note that if your first two rows are empty, then dataAddress does not have to specified. The leading null rows will be skipped by default.

    Check the documentation here