Search code examples
javaapache-sparkapache-poispark-excel

Spark Excel library unable to read whole columns, only specific data address ranges


Java app here using the Spark Excel library to read an Excel file into a Dataset<Row>. When I use the following configurations:

String filePath = "file:///Users/myuser/example-data.xlsx";
Dataset<Row> dataset = spark.read()
        .format("com.crealytics.spark.excel")
        .option("header", "true")
        .option("inferSchema", "true")
        .option("dataAddress", "'ExampleData'!A2:D7")
        .load(filePath);

This works beautifully and my Dataset<Row> is instantiated without any issues whatsoever. But the minute I go to just tell it to read any rows between A through D, it reads an empty Dataset<Row>:

// dataset will be empty
.option("dataAddress", "'ExampleData'!A:D")

This also happens if I set the sheetName and dataAddress separately:

// dataset will be empty
.option("sheetName", "ExampleData")
.option("dataAddress", "A:D")

And it also happens when, instead of providing the sheetName, I provide a sheetIndex:

// dataset will be empty; and I have experimented by setting it to 0 as well
// in case it is a 0-based index
.option("sheetIndex", 1)
.option("dataAddress", "A:D")

My question: is this expected behavior of the Spark Excel library, or is it a bug I have discovered, or am I not using the Options API correctly here?


Solution

  • FWIW I rolled up my sleeves and it looks like this is a limitation in Apache POI. The spark-excel library uses POI under the hood to read Excel files, and specifically it uses a POI AreaReference when reading in cells for a data address. AreaReference cannot handle anything other than specific cell coordinates (e.g. "A2"), and it cannot handle whole column references (e.g. "A"):

    package com.example;
    
    import org.apache.poi.ss.usermodel.Cell;
    import org.apache.poi.ss.usermodel.Row;
    import org.apache.poi.ss.usermodel.Sheet;
    import org.apache.poi.ss.usermodel.Workbook;
    import org.apache.poi.ss.util.AreaReference;
    import org.apache.poi.xssf.usermodel.XSSFWorkbook;
    
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    
    public class PoiExcelDebugger {
    
        public static void main(String[] args) {
            String filePath = "/Users/<myuser>/path/to/price-sheet-v1-simple.xlsx";
            String sheetName = "Price Sheet";
    //        String dataAddress = "'Price Sheet'!A:D";
            String dataAddress = "'Price Sheet'!A2:D7";
    
            try (FileInputStream fis = new FileInputStream(new File(filePath));
                 Workbook workbook = new XSSFWorkbook(fis)) {
    
                Sheet sheet = workbook.getSheet(sheetName);
                if (sheet == null) {
                    System.err.println("Sheet '" + sheetName + "' not found!");
                    return;
                }
    
                AreaReference areaRef = new AreaReference(dataAddress,
                        workbook.getSpreadsheetVersion());
    
                int minColIndex = areaRef.getFirstCell().getCol();
                int maxColIndex = areaRef.getLastCell().getCol();
                int minRowIndex = areaRef.getFirstCell().getRow();
                int maxRowIndex = areaRef.getLastCell().getRow();
    
                System.out.println("Parsed Column Range: " + minColIndex + " to " + maxColIndex);
                System.out.println("Parsed Row Range: " + minRowIndex + " to " + maxRowIndex);
    
                for (Row row : sheet) {
                    int rowIndex = row.getRowNum();
    
                    // Skip empty rows outside the target range
                    if (rowIndex < minRowIndex || rowIndex > maxRowIndex) {
                        continue;
                    }
    
                    System.out.print("Row " + rowIndex + ": ");
    
                    // Read all columns within A:D (indices 0 to 3)
                    for (int colIndex = minColIndex; colIndex <= maxColIndex; colIndex++) {
                        Cell cell = row.getCell(colIndex, Row.MissingCellPolicy.CREATE_NULL_AS_BLANK);
                        System.out.print(getCellValueAsString(cell) + " | ");
                    }
    
                    System.out.println();
                }
    
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    
        private static String getCellValueAsString(Cell cell) {
            if (cell == null) return "NULL";
    
            return switch (cell.getCellType()) {
                case STRING -> cell.getStringCellValue();
                case NUMERIC -> String.valueOf(cell.getNumericCellValue());
                case BOOLEAN -> String.valueOf(cell.getBooleanCellValue());
                case FORMULA -> cell.getCellFormula();
                case BLANK -> "BLANK";
                default -> "UNKNOWN";
            };
        }
    }
    

    Above, everything will work perfectly fine. But if we replace the comments and change:

    // String dataAddress = "'Price Sheet'!A:D";
    String dataAddress = "'Price Sheet'!A2:D7";
    

    to:

    String dataAddress = "'Price Sheet'!A:D";
    // String dataAddress = "'Price Sheet'!A2:D7";
    

    It won't throw an exception but it won't read anything.

    Hence: I guess this is "expected" POI behavior, albeit, very disappointing.