java data-structures machine-learning perceptron

data structure confusion over implementation of perceptron in java

I'm trying to implement the perceptron algorithm in java, just a one layer kind, not a fully neural net type. It's a classification problem that I'm trying to solve.

What I need to do is create a bag-of-words feature vector for each document in one of four categories, politics, science, sports and atheism. This is the data.

I'm trying to achieve this (a direct quote from the first answer to this question):

Example:

Document 1 = ["I", "am", "awesome"]
Document 2 = ["I", "am", "great", "great"]

Dictionary is:

["I", "am", "awesome", "great"]

So the documents as a vector would look like:

Document 1 = [1, 1, 1, 0]
Document 2 = [1, 1, 0, 2]

And with that you can do all kinds of fancy math stuff and feed this into your perceptron.

I've been able to generate the global dictionary, now I need to make one for each docuement, but how can I keep them all straight? The folder structure is pretty straight forward, i.e. `/politics/' has many articles inside, for each one I need to make a feature vector against the global dictionary. I think the iterator I'm using is what's confusing me.

This is the main class:

public class BagOfWords 
{
    static Set<String> global_dict = new HashSet<String>();

    static boolean global_dict_complete = false; 

    static String path = "/home/Workbench/SUTD/ISTD_50.570/assignments/data/train";

    public static void main(String[] args) throws IOException 
    {
        //each of the diferent categories
        String[] categories = { "/atheism", "/politics", "/science", "/sports"};

        //cycle through all categories once to populate the global dict
        for(int cycle = 0; cycle <= 3; cycle++)
        {
            String general_data_partition = path + categories[cycle]; 

            File file = new File( general_data_partition );
            Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
        }   

        //after the global dict has been filled up
        //cycle through again to populate a set of
        //words for each document, compare it to the
        //global dict. 
        for(int cycle = 0; cycle <= 3; cycle++)
        {
            if(cycle == 3)
                global_dict_complete = true;

            String general_data_partition = path + categories[cycle]; 

            File file = new File( general_data_partition );
            Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
        }

        //print the data struc              
        //for (String s : global_dict)
            //System.out.println( s );
    }
}

This iterates through the data structures:

public class Iterateur 
{
    static void iterateDirectory(File file, 
                             Set<String> global_dict, 
                             boolean global_dict_complete) throws IOException 
    {
        for (File f : file.listFiles()) 
        {
            if (f.isDirectory()) 
            {
                iterateDirectory(file, global_dict, global_dict_complete);
            } 
            else 
            {
                String line; 
                BufferedReader br = new BufferedReader(new FileReader( f ));

                while ((line = br.readLine()) != null) 
                {
                    if (global_dict_complete == false)
                    {
                        Dictionary.populate_dict(file, f, line, br, global_dict);
                    }
                    else
                    {
                        FeatureVecteur.generateFeatureVecteur(file, f, line, br, global_dict);
                    }
                }
            }
        }
    }
}

This fills up that global dictionary:

public class Dictionary 
{

    public static void populate_dict(File file, 
                                 File f, 
                                 String line, 
                                 BufferedReader br, 
                                 Set<String> global_dict) throws IOException
    {

        while ((line = br.readLine()) != null) 
        {
            String[] words = line.split(" ");//those are your words

            String word;

            for (int i = 0; i < words.length; i++) 
            {
                word = words[i];
                if (!global_dict.contains(word))
                {
                    global_dict.add(word);
                }
            }   
        }
    }
}

This is an initial attempt at filling up the document specific dictionaries:

public class FeatureVecteur 
{
    public static void generateFeatureVecteur(File file, 
                                          File f, 
                                          String line, 
                                          BufferedReader br, 
                                          Set<String> global_dict) throws IOException
    {
        Set<String> file_dict = new HashSet<String>();

        while ((line = br.readLine()) != null) 
        {

            String[] words = line.split(" ");//those are your words

            String word;

            for (int i = 0; i < words.length; i++) 
            {
                word = words[i];
                if (!file_dict.contains(word))
                {
                    file_dict.add(word);
                }
            }   
        }
    }
}

Solution

If I understand your question, you're trying to count how many instances of each word in the global dictionary occur in a given file. I'd recommend creating an array of integers, where the index represents the index into the global dictionary and the value represents the number of occurrences of that word in the file.

Then, for each word in the global dictionary, count how many times that word occurs in the file. However, you need to be careful - feature vectors require a consistent ordering of the elements, and HashSets do not guarantee this. In your example, for instance, "I" always needs to be the first element. To solve this, you might want to convert your set to an ArrayList or some other sequential list once the global dictionary is totally finished.

ArrayList<String> global_dict_list = ArrayList<String>( global_dict );

Counting could look something like this

int[] wordFrequency = new int[global_dict_list.size()];

for ( String globalWord : global_dict_list )
{
    for ( int i = 0; i < words.length; i++ ) 
    {
         if ( words[i].equals(globalWord) ) 
         {
             wordFrequency[i]++;
         }
    }
}

Nest that code in the while loop that reads line by line in the feature vector code. Hope it helps!