java apache-spark apache-poi spark-excel

Spark Excel library unable to read whole columns, only specific data address ranges

Java app here using the Spark Excel library to read an Excel file into a Dataset<Row>. When I use the following configurations:

String filePath = "file:///Users/myuser/example-data.xlsx";
Dataset<Row> dataset = spark.read()
        .format("com.crealytics.spark.excel")
        .option("header", "true")
        .option("inferSchema", "true")
        .option("dataAddress", "'ExampleData'!A2:D7")
        .load(filePath);

This works beautifully and my Dataset<Row> is instantiated without any issues whatsoever. But the minute I go to just tell it to read any rows between A through D, it reads an empty Dataset<Row>:

// dataset will be empty
.option("dataAddress", "'ExampleData'!A:D")

This also happens if I set the sheetName and dataAddress separately:

// dataset will be empty
.option("sheetName", "ExampleData")
.option("dataAddress", "A:D")

And it also happens when, instead of providing the sheetName, I provide a sheetIndex:

// dataset will be empty; and I have experimented by setting it to 0 as well
// in case it is a 0-based index
.option("sheetIndex", 1)
.option("dataAddress", "A:D")

My question: is this expected behavior of the Spark Excel library, or is it a bug I have discovered, or am I not using the Options API correctly here?

Solution

FWIW I rolled up my sleeves and it looks like this is a limitation in Apache POI. The spark-excel library uses POI under the hood to read Excel files, and specifically it uses a POI AreaReference when reading in cells for a data address. AreaReference cannot handle anything other than specific cell coordinates (e.g. "A2"), and it cannot handle whole column references (e.g. "A"):

package com.example;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.ss.util.AreaReference;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;

public class PoiExcelDebugger {

    public static void main(String[] args) {
        String filePath = "/Users/<myuser>/path/to/price-sheet-v1-simple.xlsx";
        String sheetName = "Price Sheet";
//        String dataAddress = "'Price Sheet'!A:D";
        String dataAddress = "'Price Sheet'!A2:D7";

        try (FileInputStream fis = new FileInputStream(new File(filePath));
             Workbook workbook = new XSSFWorkbook(fis)) {

            Sheet sheet = workbook.getSheet(sheetName);
            if (sheet == null) {
                System.err.println("Sheet '" + sheetName + "' not found!");
                return;
            }

            AreaReference areaRef = new AreaReference(dataAddress,
                    workbook.getSpreadsheetVersion());

            int minColIndex = areaRef.getFirstCell().getCol();
            int maxColIndex = areaRef.getLastCell().getCol();
            int minRowIndex = areaRef.getFirstCell().getRow();
            int maxRowIndex = areaRef.getLastCell().getRow();

            System.out.println("Parsed Column Range: " + minColIndex + " to " + maxColIndex);
            System.out.println("Parsed Row Range: " + minRowIndex + " to " + maxRowIndex);

            for (Row row : sheet) {
                int rowIndex = row.getRowNum();

                // Skip empty rows outside the target range
                if (rowIndex < minRowIndex || rowIndex > maxRowIndex) {
                    continue;
                }

                System.out.print("Row " + rowIndex + ": ");

                // Read all columns within A:D (indices 0 to 3)
                for (int colIndex = minColIndex; colIndex <= maxColIndex; colIndex++) {
                    Cell cell = row.getCell(colIndex, Row.MissingCellPolicy.CREATE_NULL_AS_BLANK);
                    System.out.print(getCellValueAsString(cell) + " | ");
                }

                System.out.println();
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private static String getCellValueAsString(Cell cell) {
        if (cell == null) return "NULL";

        return switch (cell.getCellType()) {
            case STRING -> cell.getStringCellValue();
            case NUMERIC -> String.valueOf(cell.getNumericCellValue());
            case BOOLEAN -> String.valueOf(cell.getBooleanCellValue());
            case FORMULA -> cell.getCellFormula();
            case BLANK -> "BLANK";
            default -> "UNKNOWN";
        };
    }
}

Above, everything will work perfectly fine. But if we replace the comments and change:

// String dataAddress = "'Price Sheet'!A:D";
String dataAddress = "'Price Sheet'!A2:D7";

to:

String dataAddress = "'Price Sheet'!A:D";
// String dataAddress = "'Price Sheet'!A2:D7";

It won't throw an exception but it won't read anything.

Hence: I guess this is "expected" POI behavior, albeit, very disappointing.