Search code examples
javapdfboxfull-text-indexing

Java - Extracting non duplicate words from PDF files


I wrote a simple program in Java using PDFBox to extract words from a PDF file. It reads the text from PDF and extract word by word.

public class Main {

    public static void main(String[] args) throws Exception {
        try (PDDocument document = PDDocument.load(new File("C:\\my.pdf"))) {

            if (!document.isEncrypted()) {

                PDFTextStripper tStripper = new PDFTextStripper();
                String pdfFileInText = tStripper.getText(document);
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    System.out.println(line);
                }

            }
        } catch (IOException e){
            System.err.println("Exception while trying to read pdf document - " + e);
        }
    }

}

Is there a way to extract the words without duplicates?


Solution

    1. Split each line by space - line.split(" ")
    2. Maintain a HashSet to hold these words and keep adding all the words to it.

    HashSet by its nature will ignore the duplicates.

    HashSet<String> uniqueWords = new HashSet<>();
    
    for (String line : lines) {
      String[] words = line.split(" ");
    
      for (String word : words) {
        uniqueWords.add(word);
      }
    }