Search code examples
python-3.xexcelpysparkazure-databricks

How to read excel file (.xlsx) using Pyspark and store it in dataframe?


I have data in excel file (.xlsx). How to read this excel data and store it in the data frame in spark?


Solution

  • On your databricks cluster, install following 2 libraries:

    Clusters -> select your cluster -> Libraries -> Install New -> Maven -> in Coordinates: com.crealytics:spark-excel_2.12:0.13.5

    Clusters -> select your cluster -> Libraries -> Install New -> PyPI-> in Package: xlrd

    Then, you will be able to read your excel as follows:

    sparkDF = spark.read.format("com.crealytics.spark.excel")
        .option("header", "true") \
        .option("inferSchema", "true") \
        .option("dataAddress", "'NameOfYourExcelSheet'!A1") \
        .load(filePath)