Search code examples
javaintellij-ideaio

java.io.BufferedWriter throttles or completely stopped during the process of writing, anyone know why?


As described in the question, I have encountered a weird throttling of write speed (even paused entirely) when using java.io.BufferedWriter to write.

My program is trying to read from a bunch of .csv files, and repack them to another bunch of .csv files grouped by the first column.

For example, if I have a line in the .csv file Tom, 86, 87, 88, this line will be written to the .csv file named Tom.csv.

I uses a HashMap<String, BufferedWriter> to cache the writers so that the program only have to open / close the writers once.

(I purposely split the file listing and the process logic for debugging)

The code:

package dev.repackcsv;

import java.io.*;
import java.nio.file.*;
import java.util.*;

public class Main {

    private static final Path USER_DIR;
    private static final Path PROP_FILE;
    private static final Properties PROPERTIES;

    private static final Path SCAN_DIR;
    private static final Path OUTPUT_DIR;

    private static final List<Path> SCANNED_FILE_LIST;
    private static final Map<String, BufferedWriter> OUTPUT_FILE_MAP;

    private static void loadProperties() {
        try (InputStream propFileInputStream = Files.newInputStream(PROP_FILE)) {
            PROPERTIES.load(propFileInputStream);
        } catch (IOException e) {
            System.err.println("[Error] Failed to load properties from \"application.properties\"");
            System.exit(1);
        }
    }

    private static String getProperty(String propertyName) {
        String property = PROPERTIES.getProperty(propertyName);
        if (property == null) {
            System.err.println("[Error] Undefined property: " + propertyName);
            System.exit(1);
        }
        return property;
    }

    static {
        USER_DIR = Paths.get(System.getProperty("user.dir"));
        PROP_FILE = USER_DIR.resolve("application.properties");
        if (!Files.exists(PROP_FILE)) {
            System.err.println("[Error] \"application.properties\" file does not exist.");
            System.exit(1);
        }

        PROPERTIES = new Properties();
        loadProperties();
        SCAN_DIR = Paths.get(getProperty("scan.dir")).toAbsolutePath();
        if (!Files.exists(SCAN_DIR)) {
            System.err.println("[Error] Scan directory does not exist");
            System.exit(1);
        }
        OUTPUT_DIR = Paths.get(getProperty("output.dir")).toAbsolutePath();
        if (!Files.exists(OUTPUT_DIR)) {
            System.err.println("[Error] Output directory does not exist");
            System.exit(1);
        }

        SCANNED_FILE_LIST = new LinkedList<>();
        OUTPUT_FILE_MAP = new HashMap<>();
    }

    private static void loadScannedFileList()
            throws IOException {
        try (DirectoryStream<Path> ds = Files.newDirectoryStream(SCAN_DIR)) {
            for (Path path : ds) {
                SCANNED_FILE_LIST.add(path.toAbsolutePath());
            }
        }
    }

    private static BufferedWriter getOutputFileBufferedWriter(String key, String headLine) throws IOException {
        if (OUTPUT_FILE_MAP.containsKey(key)) {
            return OUTPUT_FILE_MAP.get(key);
        } else {
            Path outputFile = OUTPUT_DIR.resolve(key + ".csv");
            boolean isNewFile = false;
            if (!Files.exists(outputFile)) {
                Files.createFile(outputFile);
                isNewFile = true;
            }
            BufferedWriter bw = Files.newBufferedWriter(outputFile);
            if (isNewFile) {
                bw.write(headLine);
                bw.newLine();
                bw.flush();
            }
            OUTPUT_FILE_MAP.put(key, bw);
            return bw;
        }
    }

    private static void processScannedCSV(Path csvFile)
            throws IOException {
        System.out.printf("[Info] Current file \"%s\"%n", csvFile);
        long fileSize = Files.size(csvFile);
        try (BufferedReader br = new BufferedReader(new InputStreamReader(Files.newInputStream(csvFile)))) {
            String headLine = br.readLine();
            if (headLine == null) { return; }
            String dataLine;
            long readByteSize = 0;
            while ((dataLine = br.readLine()) != null) {
                int firstCommaIndex = dataLine.indexOf(',');
                if (firstCommaIndex == -1) { continue; }
                BufferedWriter bw = getOutputFileBufferedWriter(dataLine.substring(0, firstCommaIndex), headLine);
                bw.write(dataLine);
                bw.newLine();
                readByteSize += dataLine.getBytes().length;
                System.out.print("\r[Progress] " + readByteSize + '/' + fileSize);
            }
        }
        System.out.print("\r");
    }

    private static void processScannedFiles()
            throws IOException {
        for (Path file : SCANNED_FILE_LIST) {
            if (!Files.exists(file)) {
                System.out.printf("[WARN] Scanned file \"%s\" does not exist, skipping...%n", file);
                continue;
            }
            if (!file.toString().endsWith(".csv")) { continue; }
            processScannedCSV(file);
        }
    }

    public static void main(String[] args)
            throws IOException {
        loadScannedFileList();
        processScannedFiles();
        for (BufferedWriter bw : OUTPUT_FILE_MAP.values()) {
            bw.flush();
            bw.close();
        }
    }

}

The output (For this scenario the program is freezed during the line bw.write(dataLine);):

  • I uses Intellij-IDEA as the editor and executes the program using debug mode.
Connected to the target VM, address: '127.0.0.1:25111', transport: 'socket'
[Info] Current file "..\scan-dir\L2_options_20150102.csv"
[Progress] 8166463/109787564

It would be great if anyone knows the cause / has a solution for this :( Thanks!


Solution

  • Having many files open can be a heavy load on the disk operating system, the number of file handles (limited!), and "moving the write head around" should one do things concurrently.

    About the statics

    The code shows experience (also concerning Java), maybe also from a language like C. Because the use of static is unusual. you could in main do a new Main().executeMyThings(); and drop static elsewhere.

    Measures to take

    • Do not use text but binary data, not Writer/Reader, but OutputStream/InputStream. This prevent a Unicode conversion back and forth. Also you risk data loss, when a Windows UTF-8 file on Linux.

    • Use ArrayList rather LinkedList as for many items it is likely to behave better.

    • You might want to collect file paths instead of BufferedWriters. Every BufferedWriter is not only a OS resource, but maintains a buffer (memory). It might even be more performant, writing the head line, closing and reopening it in append mode. The headline could be written be with Files.writeString.

    • System.out is costly. A ConsoleLogger might be safer would be safer with concurrency, but costs too.

    • readByteSize misses the line break, 2 for Windows files.

    • A BufferedWriter can be created with a larger/smaller buffer size.

            for (Path path : ds) {
                SCANNED_FILE_LIST.add(path.toAbsolutePath());
            }
      

    might be better as:

            ds.forEach(path -> SCANNED_FILE_LIST.add(path.toAbsolutePath());