I want to parse a text file and count some tokens. The file is read line by line, every line is split into tokens. The tokens are put in a list and are then processed by a method which is counting them. The tokens are stored in a concurrent hashmap with the token as the key and the amount as the value. I also need to sort this for the highest wordcount.
But it looks like I'm missing something, because I get different results on the counting.
private ConcurrentHashMap<String, Integer> wordCount = new ConcurrentHashMap<>();
private ExecutorService executorService = Executors.newFixedThreadPool(4);
private void parseFile(String file) {
try (BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(file),
StandardCharsets.ISO_8859_1))) {
String line;
ArrayList<String> tokenListForThread;
while ((line = reader.readLine()) != null) {
tokenListForThread = new ArrayList<>();
StringTokenizer st = new StringTokenizer(line, " .,:!?", false);
while (st.hasMoreTokens()) {
tokenListForThread.add(st.nextToken());
}
startThreads(tokenListForThread);
}
reader.close();
executorService.shutdown();
executorService.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
} catch (Exception e) {
e.printStackTrace();
System.exit(-1);
}
printWordCount();
}
private void startThreads(ArrayList<String> tokenList) {
executorService.execute(() -> countWords(tokenList));
}
private void countWords(ArrayList<String> tokenList) {
for (String token : tokenList) {
int cnt = wordCount.containsKey(token) ? wordCount.get(token) : 0;
wordCount.put(token, cnt + 1);
/*if (wordCount.containsKey(token)){
wordCount.put(token, wordCount.get(token)+ 1 );
} else{
wordCount.putIfAbsent(token, 1);
}*/
}
}
private void printWordCount() {
ArrayList<Integer> results = new ArrayList<>();
for (Map.Entry<String, Integer> entry : wordCount.entrySet()) {
results.add(entry.getValue());
}
results.sort(Comparator.reverseOrder());
for (int i = 0; i < 10; i++) {
Integer tmp = results.get(i);
System.out.println(tmp);
}
}
Where is my mistake and if possible how can I fix it?
Token count incrementation should be atomic, but it's not
int cnt = wordCount.containsKey(token) ? wordCount.get(token) : 0;
wordCount.put(token, cnt + 1);
Two threads with the same tokens' in token list may get the same cnt
simultaneously, then increment it and put back. Ie total count may be lower than the real one.
To fix it without changing initial approach you may use AtomicInteger
as wordCount
values
wordCount.putIfAbsent(token, new AtomicInteger());
wordCount.get(token).incrementAndGet();
Step 1 In case there is no token
yet, but you are going to add it. Token and zero
count should be put to the map. putIfAbsent
method is atomic, which save you from concurrent issues.
Step 2 Get reference to AtomicInteger
, which corresponds to given token and increment it. This operation is thread save either.