Search code examples
javajava-streamapache-poi

Java Stream - Read POI Sheet in parallel using Stream


I'm reading a POI Sheet from input file, ranging from 80-400 entries with first cell containing integer codes:

| CODE |
| 112  |
| 112  |
| 113  |
| 114  |
...

I'm trying to count the number of occurrences of each CODE. For that I'm iterating lines one by one using iterator, using a HashMap to make a entry when iterator finds non-existant CODE or increment the counter if CODE exists in the HashMap:

var sheet=WorkbookFactory.create(new File("file.xlsx")).getSheetAt(0);
var codeMap=new HashMap<String, Integer>();
var iterator=sheet.iterator();
iterator.next();
while(iterator.hasNext()
{
    var cell=iterator.next().getCell(0);
    if(cell!=null)
    {
        var code=new DecimalFormat("00").format(cell.getNumericCellValue());
        codeMap.computeIfPresent(code,(key,val)->val+1);
        codeMap.putIfAbsent(code,1);
    }
}
codeMap.forEach((key,value)->System.out.println("Code: "+key+", count: "+value));

I'm trying to turn the above code into parallelStream to quicken up the process, I'm having a hard time understanding it. I'm sure this can be done, but I don't know how.

Thus far I only understand to create a stream:

var stream=Stream.of(sheet).parallel();
HashMap<String,Integer> codeMap=stream.filter();     //map? filter? stuck here

I tried reading this question, but didn't understand what's happening.


Solution

  • To read Excel worksheets using Apache POI via java.util.stream.Stream the Sheet.spliterator can be used to create a Stream<Row> using StreamSupport.

    Then Stream.map can be used to get a Stream<String> with all the contents of one column. For that the Function in Stream.map needs to get all contents of that column for each row as String. To get cell contents as String DataFormatter should be used to be independent of different cell types.

    Having that one can collect that Stream<String> using the groupig-by-collector having function identity and counting-collector. That gets a Map<String, Long> as the result.

    Complete code example:

    import org.apache.poi.ss.usermodel.*;
    
    import java.io.FileInputStream;
    import java.util.Map;
    import java.util.Spliterator;
    import java.util.stream.Stream;
    import java.util.stream.StreamSupport;
    import java.util.stream.Collectors;
    import java.util.function.Function;
    
    public class ReadExcelUsingStreamsCountOccurence {
    
     public static void main(String[] args) {
    
      DataFormatter dataFormatter = new DataFormatter();
      // from 5.2.0 on the DataFormatter can set to use cached values for formula cells, so no formula evaluation needed
      dataFormatter.setUseCachedValuesForFormulaCells(true);
      
      try (FileInputStream fileIn = new FileInputStream ("./Excel.xlsx");
           Workbook workbook = WorkbookFactory.create(fileIn); ) {
    
       Sheet sheet = workbook.getSheetAt(0);
    
       boolean parallel = false;
       //boolean parallel = true;
       Spliterator<Row> spliterator = sheet.spliterator();
       Stream<Row> stream = StreamSupport.stream(spliterator, parallel);
       Map<String, Long> codeMap = stream.skip(1)
        .map(row -> dataFormatter.formatCellValue(row.getCell(0)))
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
       ;
      
       System.out.println(codeMap.getClass());
       System.out.println(codeMap);
    
      } catch(Exception ex) {
       ex.printStackTrace();
      }
    
     }
    }
    

    Hints:

    The stream.skip(1) is to skip the heading row.

    Using that code you can benchmark whether running that in parallel or not will have any influence. It should not have any significant influence.

    Why not simply always using parallel stream?

    Using a parallel stream is not always faster than using a sequential stream. Synchronization brings its own costs. So first criteria for the decision whether to use parallel stream or not is to benchmark. There is no way around.

    And running parallel may be error prone on the contrary as neither the Java Collections framework nor Apache POI are synchronized by default.

    In above example Stream.collect is thread-safe per default, even if the Java Collections framework, i.e. java.util.HashMap is not. And the Function of Stream.map only reads from one sheet of the workbook. That also should be thread-safe, even if Apache POI is not thread-safe per default.

    So above example should run without problems using boolean parallel = false; as well as boolean parallel = true;. But first criteria for the decision whether to use parallel stream or not is to benchmark. Why using a more error prone approach if it not has any advantages?