Search code examples
javastringoptimizationwords

Java statistics by words in long string


I'm writing a program in java to get statistics on words in a very big string(string s <= 100000). This should take less then 1 second and use less than 16 MB of memory.

import java.util.Scanner;
class Main{
 public static void main(String[] args){


  Scanner sc = new Scanner(System.in);
  String t = sc.nextLine();
  int i=0;
  while(t.charAt(i)==' ') i++;
  t = t.substring(i);
  String[] s = t.split(" +");

  RecString[] stat  = new RecString[s.length];
  for(i=0; i<s.length;i++){
    stat[i] = new RecString("");  
  }
  int j=0;
  for(i=0; i<s.length;i++){
    int f=0;
    for(int h =0; h<stat.length; h++){
     if(stat[h].word.equals(s[i])){
       f = 1;
       stat[h].count++;
       break;
     }
    }
    if(f==0){
      stat[j] = new RecString(s[i]);
      j++;
    }
  }
  for(i=0;i<=j;i++){
   if(stat[i].word != ""){
      System.out.println(stat[i].word+" "+(stat[i].count));
   }
  }


 }
}

class RecString{
    public  String word;
    public  int count;

    public RecString(String s){
        word = s;
        count = 1;
    }

}

This code works on strings with the length <=255 But for big strings I have time or/and memory limit.

Help me please to optimize my program


Solution

  • If your concerned with memory you will want to try to stream as much as possible.

    See http://docs.oracle.com/javase/6/docs/api/java/io/StreamTokenizer.html

    StreamTokenizer tokenizer = new StreamTokenizer(new InputStreamReader(System.in));
    
    while(tokenizer.nextToken() != StreamTokenizer.TT_EOF){
    
        if(tokenizer.ttype == StreamTokenizer.TT_WORD) {
            // found a word.
            System.out.println(tokenizer.sval);
        }
    }
    

    Of course if memory was not a problem and speed was your only concern Hadoop has an excellent word counting example: http://wiki.apache.org/hadoop/WordCount . But save that for a rainy day of learning.

    Also your logic of counting words is not right for efficiency (its O(N)). @DaveNewton is right that you probably should use a Map<String,Integer> which will give you O(1) and not your array of RecString. I'm not going to correct your conde on that as I think its a good exercise.