Search code examples
javasearchdepth-first-search

Efficient way of crawling file system using threads Java


I am currently working on a java project that does OCR in PDFs from the file system for searching its content.

In this project I am searching in a folder that the user specifies. I am taking PDFs content by OCR and checking them whether the keywords provided by the user are included in them.

I am trying to make sure when an OCR is done on a PDF, the crawling or the traversal to continue (necessarily on another thread or few threads), so that the performance of the system is not reduced dramatically.

Is there a way to accomplish this? I've included the traversing code I am using below..

public void traverseDirectory(File[] files) {
    if (files != null) {
        for (File file : files) {
            if (file.isDirectory()) {
                traverseDirectory(file.listFiles());
            } else {
                String[] type = file.getName().toString().split("\\.(?=[^\\.]+$)");
                if (type.length > 1) {
                    if (type[1].equals("pdf")) {
                        //checking content goes here
                    }
                }
            }
        }
    }
}

Solution

  • You can just use Files.walkFileTree:

    ExecutorService executor = Executors.newFixedThreadPool(threadCount);
    PdfOcrService service = ...
    Path rootPath = Paths.get("/path/to/your/directory");
    Files.walkFileTree(rootPath, new SimpleFileVisitor<Path>() {
        public void visitFile(Path path, BasicFileAttributes attrs) {
            executor.submit(() -> {
                service.performOcrOnFile(path);
            });
        }
    });