Search code examples
javastringstop-words

Removing stopwords from a String in Java


I have a string with lots of words and I have a text file which contains some Stopwords which I need to remove from my String. Let's say I have a String

s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs."

After removing stopwords, string should be like :

"love phone, super fast much cool jelly bean....but recently bugs."

I have been able to achieve this but the problem I am facing is that whenver there are adjacent stopwords in the String its removing only the first and I am getting result as :

"love phone, super fast there's much and cool with jelly bean....but recently seen bugs"  

Here's my stopwordslist.txt file : Stopwords

How can I solve this problem. Here's what I have done so far :

int k=0,i,j;
ArrayList<String> wordsList = new ArrayList<String>();
String sCurrentLine;
String[] stopwords = new String[2000];
try{
        FileReader fr=new FileReader("F:\\stopwordslist.txt");
        BufferedReader br= new BufferedReader(fr);
        while ((sCurrentLine = br.readLine()) != null){
            stopwords[k]=sCurrentLine;
            k++;
        }
        String s="I love this phone, its super fast and there's so much new and cool things with jelly bean....but of recently I've seen some bugs.";
        StringBuilder builder = new StringBuilder(s);
        String[] words = builder.toString().split("\\s");
        for (String word : words){
            wordsList.add(word);
        }
        for(int ii = 0; ii < wordsList.size(); ii++){
            for(int jj = 0; jj < k; jj++){
                if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){
                    wordsList.remove(ii);
                    break;
                }
             }
        }
        for (String str : wordsList){
            System.out.print(str+" ");
        }   
    }catch(Exception ex){
        System.out.println(ex);
    }

Solution

  • The error is because you remove element from the list you iterate on. Let says you have wordsList that contains |word0|word1|word2| If ii is equal to 1 and the if test is true, then you call wordsList.remove(1);. After that your list is |word0|word2|. ii is then incremented and is equal to 2 and now it's above the size of your list, hence word2 will never be tested.

    From there there is several solutions. For example instead of removing values you can set value to "". Or create a special "result" list.