I'm reading a POI Sheet from input file, ranging from 80-400 entries with first cell containing integer codes:
| CODE |
| 112 |
| 112 |
| 113 |
| 114 |
...
I'm trying to count the number of occurrences of each CODE
. For that I'm iterating lines one by one using iterator, using a HashMap
to make a entry when iterator finds non-existant CODE
or increment the counter if CODE
exists in the HashMap:
var sheet=WorkbookFactory.create(new File("file.xlsx")).getSheetAt(0);
var codeMap=new HashMap<String, Integer>();
var iterator=sheet.iterator();
iterator.next();
while(iterator.hasNext()
{
var cell=iterator.next().getCell(0);
if(cell!=null)
{
var code=new DecimalFormat("00").format(cell.getNumericCellValue());
codeMap.computeIfPresent(code,(key,val)->val+1);
codeMap.putIfAbsent(code,1);
}
}
codeMap.forEach((key,value)->System.out.println("Code: "+key+", count: "+value));
I'm trying to turn the above code into parallelStream
to quicken up the process, I'm having a hard time understanding it. I'm sure this can be done, but I don't know how.
Thus far I only understand to create a stream:
var stream=Stream.of(sheet).parallel();
HashMap<String,Integer> codeMap=stream.filter(); //map? filter? stuck here
I tried reading this question, but didn't understand what's happening.
To read Excel worksheets using Apache POI via java.util.stream.Stream
the Sheet.spliterator can be used to create a Stream<Row>
using StreamSupport
.
Then Stream.map
can be used to get a Stream<String>
with all the contents of one column. For that the Function
in Stream.map
needs to get all contents of that column for each row as String
. To get cell contents as String
DataFormatter should be used to be independent of different cell types.
Having that one can collect that Stream<String>
using the groupig-by-collector having function identity and counting-collector. That gets a Map<String, Long>
as the result.
Complete code example:
import org.apache.poi.ss.usermodel.*;
import java.io.FileInputStream;
import java.util.Map;
import java.util.Spliterator;
import java.util.stream.Stream;
import java.util.stream.StreamSupport;
import java.util.stream.Collectors;
import java.util.function.Function;
public class ReadExcelUsingStreamsCountOccurence {
public static void main(String[] args) {
DataFormatter dataFormatter = new DataFormatter();
// from 5.2.0 on the DataFormatter can set to use cached values for formula cells, so no formula evaluation needed
dataFormatter.setUseCachedValuesForFormulaCells(true);
try (FileInputStream fileIn = new FileInputStream ("./Excel.xlsx");
Workbook workbook = WorkbookFactory.create(fileIn); ) {
Sheet sheet = workbook.getSheetAt(0);
boolean parallel = false;
//boolean parallel = true;
Spliterator<Row> spliterator = sheet.spliterator();
Stream<Row> stream = StreamSupport.stream(spliterator, parallel);
Map<String, Long> codeMap = stream.skip(1)
.map(row -> dataFormatter.formatCellValue(row.getCell(0)))
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
;
System.out.println(codeMap.getClass());
System.out.println(codeMap);
} catch(Exception ex) {
ex.printStackTrace();
}
}
}
Hints:
The stream.skip(1)
is to skip the heading row.
Using that code you can benchmark whether running that in parallel or not will have any influence. It should not have any significant influence.
Why not simply always using parallel stream?
Using a parallel stream is not always faster than using a sequential stream. Synchronization brings its own costs. So first criteria for the decision whether to use parallel stream or not is to benchmark. There is no way around.
And running parallel may be error prone on the contrary as neither the Java Collections framework nor Apache POI are synchronized by default.
In above example Stream.collect is thread-safe per default, even if the Java Collections framework, i.e. java.util.HashMap
is not. And the Function
of Stream.map only reads from one sheet of the workbook. That also should be thread-safe, even if Apache POI is not thread-safe per default.
So above example should run without problems using boolean parallel = false;
as well as boolean parallel = true;
. But first criteria for the decision whether to use parallel stream or not is to benchmark. Why using a more error prone approach if it not has any advantages?