Search code examples
javaionio

Is there any in-memory scanner which is similar to java.util.Scanner for reading a line of string?


I have been trying to make an in-memory string processing application for my assignment. So, I thought that loading the whole string into memory, and then parsing a string which is loaded into memory.

For this, at first I made an byte-string parser which acts same as scanner but using CharBuffer. (Whole string is loaded into memory) But it is not fast even disk-based string parser.

At that time, I found that CharBuffer implements Readable, so I tried to use scanner like this:

FileChannel channel = new FileInputStream(file).getChannel();
MappedByteBuffer mapped_buffer =
             channel.map(FileChannel.MapMode.READ_ONLY, 0, channel.size());
Charset charset = Charset.forName("US-ASCII");
CharsetDecoder decoder = charset.newDecoder();
CharBuffer buffer = decoder.decode(mapped_buffer);
Scanner sc = new Scanner(buffer).useDelimiter("\n");

But it is similar or even slower than just disk-based scanner. The disk-based program's sample code is below:

File target = new File(target_path);

Scanner scan = new Scanner(target);
    while (scan.hasNext()) {
        line = scan.nextLine();
        ... }

Everyone thinks that in-memory processing is much faster than disk-based processing. To achieve above performance, what should I consider to parse a string in memory? Is it reasonable to use scanner to read in-memory string data? Or is the scanner I use not read parsed line of string from memory?


Solution

  • Why use Scanner at all? Scanner, CharsetDecoder, etc, are all going to be slow.

    Especially if all you are reading is ASCII, you don't really need any of that.

    byte[] bytes = new byte[(int)file.length()];
    
    FileInputStream in = new FileInputStream(file);
    in.read(bytes);
    in.close();
    
    char[] text = new char[bytes.length];
    for (int i = 0; i < bytes.length; i++) {
        text[i] = (char)(bytes[i] & 0xFF);
    }
    
    for (String line : new String(text).split("\n")) {
        //
    }
    

    UTF-16 is only an extra step more complicated.

    If you want to read line by line that's not all that complicated. I would still recommend against something like Scanner.

    StringBuilder line = new StringBuilder(1024);
    FileInputStream in = new FileInputStream(file);
    
    int next;
    
    boolean lb = true;
    
    while ((next = in.read()) != -1) {
    
        if (next == 0xD || next == 0xA) {
    
            // skip if there are multiple line breaks
            if (lb) continue;
    
            lb = true;
            sendNextLineSomewhere(line.toString());
    
            // avoid new object creations
            line.delete(0, line.length());
    
        } else {
    
            lb = false;
            line.append((char)next);
        }
    }
    
    in.close();
    

    One side note about ASCII line breaks is that there are two characters related to it. Line Feed (0xA) and Carriage Return (0xD). Some text editors (Windows Notepad for example) register a line break from a two character CR+LF combination. It's just a thing to keep in mind. If you don't take it in to account and your file originates from a program like that you'll get blank lines. And on the output side, if you don't write the CR+LF combo when you want a new line programs that want it won't read the file correctly.