How can I optimize this code?

My current project has us using TreeSet and TreeMap in Java, with an input array of 10514 Song elements read in from a text file. Each Song contains a Artist, Title and Lyric fields. The aim of this project is to conduct fast searches on the lyrics using sets and maps.

First, I iterate over the input Song array, accessing the lyrics field and creating a Scanner object to iterate over the lyric words using this code: commonWords is a TreeSet of words that should not be keys, and lyricWords is the overall map of words to Songs.

public void buildSongMap() {
    for (Song song:songs) {
        //method variables
        String currentLyrics= song.getLyrics().toLowerCase(); 
        TreeSet<Song> addToSet=null;
        Scanner readIn= new Scanner(currentLyrics);
        String word= readIn.next();

        while (readIn.hasNext()) {

            if (!commonWords.contains(word) && !word.equals("") && word.length()>1) {
                if (lyricWords.containsKey(word)) {
                    addToSet= lyricWords.get(word);
                    addToSet.add(song);
                    word=readIn.next();
                } else 
                    buildSongSet(word);

            } else 
                word= readIn.next();
        }

    }

In order to build the songSet, I use this code:

public void buildSongSet(String word) {     
    TreeSet<Song> songSet= new TreeSet<Song>();
    for (Song song:songs) {
        //adds song to set 
        if (song.getLyrics().contains(word)) {
            songSet.add(song);
        }
    }
    lyricWords.put(word, songSet);
    System.out.println("Word added "+word);
}

Now, since buildSongSet is called from inside a loop, creating the map executes in N^2 time. When the input array is 4 songs, searches run very fast, but when using the full array of 10514 elements, it can take over 15+ min to build the map on a 2.4GHz machine with 6 GiB RAM. What can I do to make this code more efficient? Unfortunately, reducing the input data is not an option.

Solution

It looks like your buildSongSet is doing redundant work. Your block:

if (lyricWords.containsKey(word)) {
    addToSet= lyricWords.get(word);
    addToSet.add(song);
    word=readIn.next();
}

adds a song to an existing set. So, when you find a word you don't know about, just add one song to it. Change buildSongSet to:

public void buildSongSet(String word, Song firstSongWithWord) {     
    TreeSet<Song> songSet= new TreeSet<Song>();
    songSet.add(firstSongWithWord);
    lyricWords.put(word, songSet);
    System.out.println("Word added "+word);
}

the remaining songs left to be iterated will then be added to that songset from the first block of code if they contain that word. I think that should work.

EDIT just saw this was homework... so removed the HashSet recommendations..

Ok.. so suppose you have these Songs in order with lyrics:

Song 1 - foo
Song 2 - foo bar
Song 3 - foo bar baz

Song 1 will see that foo does not contain lyricWords, so it will call buildSongSet and create a set for foo. It will add itself into the set containing foo.

Song 2 will see that foo is in lyricWords, and add itself to the set. It will see bar is not in the set, and create a set and add itself. It doesn't need to traverse previous songs since the first time the word was seen was in Song 2.

Song 3 follows the same logic.

Another thing you can try doing to optimize your code is to figure out a way to not process duplicate words in the lyrics. if your lyrics are foo foo foo foo bar bar bar foo bar then you're going to be doing a lot of unnecessary checks.

EDIT also see rsp's answer - additional speedups there, but the big speedup is getting rid of the inner loop - glad it's down to 15 secs now.