Search code examples
java

Collector (Java Streams) using StringBuilder hangs


I'm trying to implement grep in Java with Stream as part of a job training program. The code runs fine in Intellij but when I package it with mvn clean package it just hangs while calling collect (I'm omitting the getters/setters and some other methods)

class JavaGrep {
    public void process() throws IOException {
        Stream<Path> paths = listFiles(this.rootPath);
        logger.debug(this.outFile);
        paths.map(
            Path::toFile
        ).forEach(
            this::readLargeFile
        );
    }

    public Stream<Path> listFiles(String rootDir) {
        try {
            return Files.walk(Paths.get(rootDir)).filter(
                path -> !Files.isDirectory(path)
            );
        } catch (IOException e) {
            throw new RuntimeException("Error: IO Exception while reading root directory", e);
        }
    }

    void readLargeFile(File file)  {
        try(FileInputStream fis = new FileInputStream(file);) {
            byte[] buffer = new byte[CHUNK_SIZE]; // Buffer to hold 20MB chunks
            int counter = 1;
            long fileSize = Files.size(file.toPath());
            while (fis.read(buffer)!= -1) {
                logger.debug("Progress reading {} : {}", file.toPath(), Math.min((float) counter * (float)CHUNK_SIZE / (float)fileSize * 100.0, 100.0));
                counter++;
                String s = new String(buffer, StandardCharsets.ISO_8859_1);
                Stream<String> lines = Arrays.stream(s.split("\n"));
                String chunkContent = lines.filter(
                    this::containsPattern
                ).collect(
                    new StringBuilderCollector()
                );
                logger.debug("writing to file");
                this.writeToFile(chunkContent);
            }
        } catch (IOException e) {
            throw new RuntimeException("Error while reading file " + file.getPath(), e);
        }
    }

    public void writeToFile(String lines) {
        Path outFilePath = Paths.get(this.outFile);
        try(
            FileChannel fileChannel = FileChannel.open(outFilePath, StandardOpenOption.APPEND, StandardOpenOption.CREATE);
            FileLock lock = fileChannel.lock()
        ) {
            Files.write(outFilePath, lines.getBytes(StandardCharsets.ISO_8859_1), StandardOpenOption.APPEND);
        } catch (IOException e) {
            logger.error("error while opening atomic file channel for {}", outFilePath, e);
        }
    }
}

And the Collector class that I implemented:

public class StringBuilderCollector implements Collector<String,StringBuilder,String> {

    StringBuilderCollector() {
    }

    @Override
    public Supplier<StringBuilder> supplier() {
        return StringBuilder::new;
    }

    @Override
    public BiConsumer<StringBuilder, String> accumulator() {
        return (sb, string) -> {
            sb.append(string);
            sb.append('\n');
        };
    }

    @Override
    public BinaryOperator<StringBuilder> combiner() {
        return (sb1, sb2) -> {
            sb1.append(sb2);
            return sb1;
        };
    }

    @Override
    public Function<StringBuilder, String> finisher() {
        return StringBuilder::toString;
    }

    @Override
    public Set<Characteristics> characteristics() {
        return Collections.emptySet();
    }
}

At this point I'm not sure where things went wrong, since I was able to successfully run this a few times before (maybe a resource leak?). So far I've tried:

  1. Creating the output file if it does not exist
  2. Printing / logging (accumulator seems to work fine, it prints all the lines it's supposed to print so I'd assume that the filtering part is fine. However the code inside combiner never seems to run).
  3. Putting Files.walk inside a try() block although that gave me an illegal state exception. ChatGPT said the resource closes automatically when I call collect so I don't think this is an issue

Solution

  • The initial problem in your code is that you are ignoring the returned value of fis.read(buffer) which tells you number of bytes read in:

    int len = 0;
    while (fis.read(buffer)!= -1) {
        ...
        // then use len:
        String s = new String(buffer, 0, len, StandardCharsets.ISO_8859_1);
    

    This avoids allocating a huge blank ended String and then using your trim() as in your answer. Unfortunately this fix still has a problem as it coulds still split a line part way. Better to read line by line such as with BufferedReader.

    Your program is considerably more complicated than it needs to be as you've ignored many API calls of java.nio.Files. Much of your code in readLargeFile is a long way of reproducing one call to Files.lines, writeToFile is one line call to Files.write( Iterable ). You don't need to collect potentially 20MB of matches into memory structure StringBuilderCollector, because you can convert Stream<String> to Iterable<String> which can be passed to Files.write. For example:

    void readLargeFile(Path path) {
        logger.debug("readLargeFile "+path);
        Path outFilePath = Paths.get(this.outFile);
    
        // This gets a stream of all lines:
        try(Stream<String> lines = Files.lines(path, StandardCharsets.ISO_8859_1) ) {
    
            // Make stream conform to iterable (NB one time use only)
            Iterable<String> iterable = lines.filter(this::containsPattern)::iterator;
    
            // Append all lines of iterable => output file, creating if not exists
            Files.write(outFilePath, iterable, StandardCharsets.ISO_8859_1, StandardOpenOption.APPEND, StandardOpenOption.CREATE);
    
        } catch (IOException e) {
            // Java has a class representing runtime IOException:
            throw new UncheckedIOException("Error while reading file " + path, e);
        }
    }
    

    Your launcher does not need to use File:

    public void process() throws IOException {
        try(Stream<Path> paths = listFiles(this.rootPath)) {
            paths.forEach(this::readLargeFile);
        }
    }
    

    And finally, Files.find could be used as alternative to Files.walk as it reduces filesystem accesses in the filter if you want the file attributes:

    public Stream<Path> listFiles(String rootDir) throws IOException{
        // or for special/hidden files use (p,a) -> !a.isDirectory()
        return Files.find(Paths.get(rootDir), Integer.MAX_VALUE, (p,a) -> a.isRegularFile());  
    }