Search code examples
javaoptimizationbuild-time

How can I optimize this code?


My current project has us using TreeSet and TreeMap in Java, with an input array of 10514 Song elements read in from a text file. Each Song contains a Artist, Title and Lyric fields. The aim of this project is to conduct fast searches on the lyrics using sets and maps.

First, I iterate over the input Song array, accessing the lyrics field and creating a Scanner object to iterate over the lyric words using this code: commonWords is a TreeSet of words that should not be keys, and lyricWords is the overall map of words to Songs.

public void buildSongMap() {
    for (Song song:songs) {
        //method variables
        String currentLyrics= song.getLyrics().toLowerCase(); 
        TreeSet<Song> addToSet=null;
        Scanner readIn= new Scanner(currentLyrics);
        String word= readIn.next();

        while (readIn.hasNext()) {

            if (!commonWords.contains(word) && !word.equals("") && word.length()>1) {
                if (lyricWords.containsKey(word)) {
                    addToSet= lyricWords.get(word);
                    addToSet.add(song);
                    word=readIn.next();
                } else 
                    buildSongSet(word);

            } else 
                word= readIn.next();
        }

    }

In order to build the songSet, I use this code:

public void buildSongSet(String word) {     
    TreeSet<Song> songSet= new TreeSet<Song>();
    for (Song song:songs) {
        //adds song to set 
        if (song.getLyrics().contains(word)) {
            songSet.add(song);
        }
    }
    lyricWords.put(word, songSet);
    System.out.println("Word added "+word);
}

Now, since buildSongSet is called from inside a loop, creating the map executes in N^2 time. When the input array is 4 songs, searches run very fast, but when using the full array of 10514 elements, it can take over 15+ min to build the map on a 2.4GHz machine with 6 GiB RAM. What can I do to make this code more efficient? Unfortunately, reducing the input data is not an option.


Solution

  • It looks like your buildSongSet is doing redundant work. Your block:

    if (lyricWords.containsKey(word)) {
        addToSet= lyricWords.get(word);
        addToSet.add(song);
        word=readIn.next();
    } 
    

    adds a song to an existing set. So, when you find a word you don't know about, just add one song to it. Change buildSongSet to:

    public void buildSongSet(String word, Song firstSongWithWord) {     
        TreeSet<Song> songSet= new TreeSet<Song>();
        songSet.add(firstSongWithWord);
        lyricWords.put(word, songSet);
        System.out.println("Word added "+word);
    }
    

    the remaining songs left to be iterated will then be added to that songset from the first block of code if they contain that word. I think that should work.

    EDIT just saw this was homework... so removed the HashSet recommendations..

    Ok.. so suppose you have these Songs in order with lyrics:

    • Song 1 - foo
    • Song 2 - foo bar
    • Song 3 - foo bar baz

    Song 1 will see that foo does not contain lyricWords, so it will call buildSongSet and create a set for foo. It will add itself into the set containing foo.

    Song 2 will see that foo is in lyricWords, and add itself to the set. It will see bar is not in the set, and create a set and add itself. It doesn't need to traverse previous songs since the first time the word was seen was in Song 2.

    Song 3 follows the same logic.

    Another thing you can try doing to optimize your code is to figure out a way to not process duplicate words in the lyrics. if your lyrics are foo foo foo foo bar bar bar foo bar then you're going to be doing a lot of unnecessary checks.

    EDIT also see rsp's answer - additional speedups there, but the big speedup is getting rid of the inner loop - glad it's down to 15 secs now.