I wrote a simple program in Java using PDFBox to extract words from a PDF file. It reads the text from PDF and extract word by word.
public class Main {
public static void main(String[] args) throws Exception {
try (PDDocument document = PDDocument.load(new File("C:\\my.pdf"))) {
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
System.out.println(line);
}
}
} catch (IOException e){
System.err.println("Exception while trying to read pdf document - " + e);
}
}
}
Is there a way to extract the words without duplicates?
space
- line.split(" ")
HashSet
to hold these words and keep adding all the words to it. HashSet by its nature will ignore the duplicates.
HashSet<String> uniqueWords = new HashSet<>();
for (String line : lines) {
String[] words = line.split(" ");
for (String word : words) {
uniqueWords.add(word);
}
}