Search code examples
javastop-words

Removing words from a text file in java


I am having some problems with this Java task.

I have two files — hello.txt and stopwords.txt. I am just trying to remove the words that are in the stopwords.txt file in the hello.txt file and have the frequency of the top n elements in the updated hello file displayed in the console.

I know how to do this in python, but not in java. I believe a hash map would be the best approach for this.

Thank you very much!

I have attempted to use this code, but I am not getting any output:

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.*;

public class practice {

    public static void main(String[] args) throws IOException {
        ArrayList stopword = new ArrayList<>();
        try {
            FileInputStream fis = new FileInputStream("stopwords.txt");
            byte b[] = new byte[fis.available()];
            fis.read(b);
            fis.close();
            String data[] = new String(b).trim().split("\n");
            for (int i = 0; i < data.length; i++) {
                stopword.add(data[i].trim());
            }
            FileInputStream fis2 = new FileInputStream("hello.txt");
            byte b1[] = new byte[fis2.available()];
            fis2.read(b);
            fis2.close();
            String data1[] = new String(b1).trim().split("\n");
//                  String myFile="";
            for(int i = 0; i < data1.length; i++) {
                String myFile = "";
                String s2[] = data[i].split("/s");
                for (int j = 0; j < s2.length; j++) {
                    if (!(stopword.contains(s2[j].trim().toLowerCase()))) {
                        myFile = myFile+s2[j]+" ";
                    }
                }
                System.out.println(myFile+"\n");
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
        File file = new File("hello.txt");
        try (Scanner sc = new Scanner(new FileInputStream(file))) {
            int count=0;
            while(sc.hasNext()){
                sc.next();
                count++;
            }
            System.out.println("Number of words for new file: " + count);
        }
    }
}

Solution

  • Given a file hello.txt containing remove leave remove leave remove leave re move remov e leave remove hello remove world!

    And a file stopWords.txt containing remove world

    Using the Files class, I can read the entire contents of the file and save it into a (normalized) string. Then, I can use replaceAll() from String class to replace a stopWord from the file. My example doesn't save the new String back to the file, but this can be easily done by adding the following lines:

    byte[] strToBytes = helloTxt.getBytes();
    Files.write(Paths.get("hello.txt"), strToBytes);
    

    The code to read the file and replace all found stop words:

    public class RemoveWords {
        public static void main (String[] args) {
            try {
                // per @markspace's comment
                String helloTxt = Files.readString(Paths.get("hello.txt"), Charset.defaultCharset());
                String stopWordsTxt = Files.readString(Paths.get("stopwords.txt"), Charset.defaultCharset());
                
                String[] stopWords = stopWordsTxt.split("\\s");
                
                for (String stopWord : stopWords) {
                    helloTxt = helloTxt.replaceAll(stopWord, "");
                }
                
                System.out.println(helloTxt);
                
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
    

    Outputs

     leave  leave  leave re move remov e leave  hello  !
    

    To calculate the frequency of words, you may want to check out this solution I came up with for another use case.