Search code examples
javadefensive-programmingdata-integrity

How else can this code be optimized for defensive programming?


For my data structures project, the goal is to read in a provided file containing over 10000 songs with artist, title and lyrics clearly marked, and each song is separated by a line with a single double quote. I've written this code to parse the text file, and it works, with a running time of just under 3 seconds to
read the 422K lines of text
create a Song object
add said Song to an ArrayList

The parsing code I wrote is:

if (songSource.canRead()) {  //checks to see if file is valid to read
    readIn= new Scanner(songSource);
    while (readIn.hasNextLine()) {
 do {
     readToken= readIn.nextLine();

             if (readToken.startsWith("ARTIST=\"")) {
  artist= readToken.split("\"")[1];
      } 
      if (readToken.startsWith("TITLE=\"")) {
  title= readToken.split("\"")[1];
      } 
      if (readToken.startsWith("LYRICS=\"")) {
  lyrics= readToken.split("\"")[1];
      } else {
  lyrics+= "\n"+readToken;
      }//end individual song if block
 } while (!readToken.startsWith("\"")); //end inner while loop

    songList.add(new Song(artist, title, lyrics));

    }//end while not EOF 
} //end if file can be read 

I was talking with my Intro to Algorithms professor about the code for this project, and he stated that I should try to be more defensive in my code to allow for inconsistencies in data provided by other people. Originally I was using if/else blocks between the Artist, Title and Lyrics fields, and on his suggestion I changed to sequential if statements. While I can see his point, using this code example, how can I be more defensive about allowing for input inconsistencies?


Solution

  • You are assuming that the input is perfect. If you look at the way your application is currently setup, Based on a quick read of your algorithm the data would look like this

    ARTIST="John"
    TITLE="HELLO WORLD"
    LYRICS="Sing Song All night long"
    "
    

    But consider the case

    ARTIST="John"
    TITLE="HELLO WORLD"
    LYRICS="Sing Song All night long"
    "
    ARTIST="Peter"
    LYRICS="Sing Song All night long"
    "
    

    Based on your algorithm, you now have 2 songs characterized as

    songList = { Song("JOHN", "HELLO WORLD", "Sing Song All night long"),
                 Song("Peter", "HELLO WORLD", "Sing Song All night long") }
    

    With the current algorithm, the artist and title are exposed and will show up in the 2nd song even though they were not defined. You need to reset your three variables.

    in your else you are just dumping the complete line into lyrics. What if you had already pulled Lyrics out, you are now overriding that. Test case

     ARTIST="John"
     LYRICS="Sing Song All night long"
     TILET="HELLO WORLD"
     "
    

    Consider sending this record to an Error state. So when the batch read is completed, an error report can be generated and fixed.

    Also you only consider EOF after an artist was read in. What if the EOF occurs during the Artist read, and the file does not end in ". You are going to get an exception there. In your do/while add another check for hasNextLine()