Is bag-of-words the same thing as document term matrix?
I have a training data set that consists of many files. I want to read all of them into a data structure (hash map?) to create a bag-of-words model for a particular class of documents, either science, religion, sports, or sex, in preparation for a perceptron implementation.
Right now I have the simplest of simple Java I/o constructs, I.e.
String text;
BufferedReader br = new BufferedReader(new FileReader("file"));
while ((text = br.readLine()) != null)
{
//read in multiple files
//generate a hash map with each unique word
//as a key and the frequency with which that
//word appears as the value
}
So what I want to do is read input from multiple files in a directory and save all the data to one underlying structure, how to do that? Should I write it out to a file somewhere?
I think a hashmap, as I described in the comments of the code above would work, based on my understanding of bag-of-words. Is that right? How could I implement such a thing to sych with the reading of input from multiple files. How should I store it so I can later incorporate that into my perceptron algorithm?
I've seen this done like so:
String names = new String[]{"a.txt", "b.txt", "c.txt"};
StringBuffer strContent = new StringBuffer("");
for (String name : names) {
File file = new File(name);
int ch;
FileInputStream stream = null;
try {
stream = new FileInputStream(file);
while( (ch = stream.read()) != -1) {
strContent.append((char) ch);
}
} finally {
stream.close();
}
}
But this is a lame solution because you need to specify in advance all the files, I think that should be more dynamic. If possible.
You can try below program, its dynamic, you just need to provide your directory path.
public class BagOfWords {
ConcurrentHashMap<String, Set<String>> map = new ConcurrentHashMap<String, Set<String>>();
public static void main(String[] args) throws IOException {
File file = new File("F:/Downloads/Build/");
new BagOfWords().iterateDirectory(file);
}
private void iterateDirectory(File file) throws IOException {
for (File f : file.listFiles()) {
if (f.isDirectory()) {
iterateDirectory(file);
} else {
// Read File
// Split and put it in a set
// add to map
}
}
}
}