Search code examples
javaiomachine-learninghashmapperceptron

plus operation on Integer object, Read in multiple files from a directory to create bag-of-words in Java


Is bag-of-words the same thing as document term matrix?

I have a training data set that consists of many files. I want to read all of them into a data structure (hash map?) to create a bag-of-words model for a particular class of documents, either science, religion, sports, or sex, in preparation for a perceptron implementation.

Right now I have the simplest of simple Java I/o constructs, I.e.

    String text; 
    BufferedReader br = new BufferedReader(new FileReader("file"));

    while ((text = br.readLine()) != null) 
    {
        //read in multiple files
        //generate a hash map with each unique word
        //as a key and the frequency with which that
        //word appears as the value
    }

So what I want to do is read input from multiple files in a directory and save all the data to one underlying structure, how to do that? Should I write it out to a file somewhere?

I think a hashmap, as I described in the comments of the code above would work, based on my understanding of bag-of-words. Is that right? How could I implement such a thing to sych with the reading of input from multiple files. How should I store it so I can later incorporate that into my perceptron algorithm?

I've seen this done like so:

  String names = new String[]{"a.txt", "b.txt", "c.txt"};
  StringBuffer strContent = new StringBuffer("");

  for (String name : names) {
      File file = new File(name); 
      int ch;
      FileInputStream stream = null;  
      try {
          stream = new FileInputStream(file);   
          while( (ch = stream.read()) != -1) {
          strContent.append((char) ch); 
          }
      } finally {
          stream.close();  
      } 
   }

But this is a lame solution because you need to specify in advance all the files, I think that should be more dynamic. If possible.


Solution

  • You can try below program, its dynamic, you just need to provide your directory path.

    public class BagOfWords {
    
    ConcurrentHashMap<String, Set<String>> map = new ConcurrentHashMap<String, Set<String>>();
    
    public static void main(String[] args) throws IOException {
        File file = new File("F:/Downloads/Build/");
        new BagOfWords().iterateDirectory(file);
    }
    
    private void iterateDirectory(File file) throws IOException {
        for (File f : file.listFiles()) {
            if (f.isDirectory()) {
                iterateDirectory(file);
            } else {
                // Read File
                // Split and put it in a set
                // add to map
            }
        }
    }
    

    }