I implemented a wordcount program with Java. Basically, the program takes a large file (in my tests, I used a 10 gb data file that contained numbers only), and counts the number of times each 'word' appears - in this case, a number (23723 for example might appear 243 times in the file).
Below is my implementation. I seek to improve it, with mainly performance in mind, but a few other things as well, and I am looking for some guidance. Here are a few of the issues I wish to correct:
Currently, the program is threaded and works properly. However, what I do is pass a chunk of memory (500MB/NUM_THREADS)
to each thread, and each thread proceeds to wordcount. The problem here is that I have the main thread wait for ALL the threads to complete before passing more data to each thread. It isn't too much of a problem, but there is a period of time where a few threads will wait and do nothing for a while. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance). Currently, I use the fact that I know the input is an integer, and just store the temporary variables as an int
, so no memory problems there. I want to be able to use some sort of delimiter, whether that delimiter be a space, or several characters.
I am using a global ConcurrentHashMap to story key value pairs. For example, if a thread finds a number "24624", it searches for that number in the map. If it exists, it will increase the value of that key by one. The value of the keys at the end represent the number of occurrences of that key. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
I am open to other possibilities as well, this is just what comes to mind.
Note: Splitting the file is not an option I want to explore, as I might be deploying this on a server in which I should not be creating my own files, but if it would really be a performance boost, I might listen.
Other Note: I am new to java threading, as well as new to StackOverflow. Be gentle.
public class BigCount2 {
public static void main(String[] args) throws IOException, InterruptedException {
int num, counter;
long i, j;
String delimiterString = " ";
ArrayList<Character> delim = new ArrayList<Character>();
for (char c : delimiterString.toCharArray()) {
delim.add(c);
}
int counter2 = 0;
num = Integer.parseInt(args[0]);
int bytesToRead = 1024 * 1024 * 1024 / 2; //500 MB, size of loop
int remainder = bytesToRead % num;
int k = 0;
bytesToRead = bytesToRead - remainder;
int byr = bytesToRead / num;
String filepath = "C:/Users/Daniel/Desktop/int-dataset-10g.dat";
RandomAccessFile file = new RandomAccessFile(filepath, "r");
Thread[] t = new Thread [num];//array of threads
ConcurrentMap<Integer, Integer> wordCountMap = new ConcurrentHashMap<Integer, Integer>(25000);
byte [] byteArray = new byte [byr]; //allocates 500mb to a 2D byte array
char[] newbyte;
for (i = 0; i < file.length(); i += bytesToRead) {
counter = 0;
for (j = 0; j < bytesToRead; j += byr) {
file.seek(i + j);
file.read(byteArray, 0, byr);
newbyte = new String(byteArray).toCharArray();
t[counter] = new Thread(
new BigCountThread2(counter,
newbyte,
delim,
wordCountMap));//giving each thread t[i] different file fileReader[i]
t[counter].start();
counter++;
newbyte = null;
}
for (k = 0; k < num; k++){
t[k].join(); //main thread continues after ALL threads have finished.
}
counter2++;
System.gc();
}
file.close();
System.exit(0);
}
}
class BigCountThread2 implements Runnable {
private final ConcurrentMap<Integer, Integer> wordCountMap;
char [] newbyte;
private ArrayList<Character> delim;
private int threadId; //use for later
BigCountThread2(int tid,
char[] newbyte,
ArrayList<Character> delim,
ConcurrentMap<Integer, Integer> wordCountMap) {
this.delim = delim;
threadId = tid;
this.wordCountMap = wordCountMap;
this.newbyte = newbyte;
}
public void run() {
int intCheck = 0;
int counter = 0; int i = 0; Integer check; int j =0; int temp = 0; int intbuilder = 0;
for (i = 0; i < newbyte.length; i++) {
intCheck = Character.getNumericValue(newbyte[i]);
if (newbyte[i] == ' ' || intCheck == -1) { //once a delimiter is found, the current tempArray needs to be added to the MAP
check = wordCountMap.putIfAbsent(intbuilder, 1);
if (check != null) { //if returns null, then it is the first instance
wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1);
}
intbuilder = 0;
}
else {
intbuilder = (intbuilder * 10) + intCheck;
counter++;
}
}
}
}
Some thoughts on a little of most ..
.. I believe some sort of worker pool or executor service could solve this problem (I have not learned the syntax for this yet).
If all the threads take about the same time to process the same amount of data, then there really isn't that much of a "problem" here.
However, one nice thing about a Thread Pool is it allows one to rather trivially adjust some basic parameters such as number of concurrent workers. Furthermore, using an executor service and Futures can provide an additional level of abstraction; in this case it could be especially handy if each thread returned a map as the result.
The program will only work for a file that contains integers. That's a problem. I struggled with this a lot, as I didn't know how to iterate through the data without creating loads of unused variables (using a String or even StringBuilder had awful performance) ..
This sounds like an implementation issue. While I would first try a StreamTokenizer (because it's already written), if doing it manually, I would check out the source - a good bit of that can be omitted when simplifying the notion of a "token". (It uses a temporary array to build the token.)
I am using a global ConcurrentHashMap to story key value pairs. .. So is this the proper design? Would I gain in performance by giving each thread it's own hashmap, and then merging them all at the end?
It would reduce locking and may increase performance to use a separate map per thread and merge strategy. Furthermore, the current implementation is broken as wordCountMap.put(intbuilder, wordCountMap.get(intbuilder) + 1)
is not atomic and thus the operation might under count. I would use a separate map simply because reducing mutable shared state makes a threaded program much easier to reason about.
Is there any other way of seeking through a file with an offset without using the class RandomAccessMemory? This class will only read into a byte array, which I then have to convert. I haven't timed this conversion, but maybe it could be faster to use something else.
Consider using a FileReader (and BufferedReader) per thread on the same file. This will avoid having to first copy the file into the array and slice it out for individual threads which, while the same amount of total reading, avoids having to soak up so much memory. The reading done is actually not random access, but merely sequential (with a "skip") starting from different offsets - each thread still works on a mutually exclusive range.
Also, the original code with the slicing is broken if an integer value was "cut" in half as each of the threads would read half the word. One work-about is have each thread skip the first word if it was a continuation from the previous block (i.e. scan one byte sooner) and then read-past the end of it's range as required to complete the last word.