java string split character non-ascii-characters

Splitting a string containing multi-byte characters into an array of strings

I have this piece of code which is intended to split strings into an array of strings using CHUNK_SIZE as the size of the split, in bytes (I'm doing this for paginating results). This works in most cases when characters are 1 byte, but when I have a multi-byte character (such as for example 2-byte french characters (like é) or 4 byte chinese characters) at precisely the split location, I end up with 2 unreadable characters at the end of my first array element and at the start of the second one.

Is there a way to fix the code to account for multibyte characters so they are maintained in the final result?

public static ArrayList<String> splitFile(String data) throws Exception {
    ArrayList<String> messages = new ArrayList<>();
    int CHUNK_SIZE = 400000;// 0.75mb

    if (data.getBytes().length > CHUNK_SIZE) {
        byte[] buffer = new byte[CHUNK_SIZE];
        int start = 0, end = buffer.length;
        long remaining = data.getBytes().length;
        ByteArrayInputStream inputStream =
                new ByteArrayInputStream(data.getBytes());

        while ((inputStream.read(buffer, start, end)) != -1) {
            ByteArrayOutputStream outputStream =
                    new ByteArrayOutputStream();
            outputStream.write(buffer, start, end);
            messages.add(outputStream.toString("UTF-8"));
            remaining = remaining - end;

            if (remaining <= end) {
                end = (int) remaining;
            }
        }
        return messages;
    }

    messages.add(data);
    return messages;
}

Solution

public static List<String> splitFile(String data) throws IOException {
    List<String> messages = new ArrayList<>();
    final int CHUNK_SIZE = 400_000;// 0.75mb

    byte[] dataBytes = data.getBytes(StandardCharsets.UTF_8);
    byte[] buffer = new byte[CHUNK_SIZE];
    int start = 0;
    final int end = CHUNK_SIZE;
    ByteArrayInputStream inputStream = new ByteArrayInputStream(dataBytes);

    for (; ; ) {
        int read = inputStream.read(buffer, start, end - start);
        if (read == -1) {
            if (start != 0) {
                messages.add(new String(buffer, 0, start,
                        StandardCharsets.UTF_8));
            }
            break;
        }
        // Check for half read multi-byte sequences:
        int fullEnd = start + read;
        while (fullEnd > 0) {
            byte b = buffer[fullEnd - 1];
            if (b >= 0) { // ASCII.
                break;
            }
            if ((b & 0xC0) == 0xC0) { // Start byte of sequence.
                --fullEnd;
                break;
            }
            --fullEnd;
        }
        messages.add(new String(buffer, 0, fullEnd, StandardCharsets.UTF_8));
        start += read - fullEnd;
        if (start > 0) { // Copy the bytes after fullEnd to the start.
            System.arraycopy(buffer, fullEnd, buffer, 0, start);
            //               src     srcI     dest    destI len
        }
    }
    return messages;
}

I have kept the ByteArrayInputStream, as most often one reads from InputStream, instead of having all bytes in memory.

Then the chunk buffer is read, from start rather then from 0, as there might linger some bytes from the prior chunk read.

Reading gives the number of bytes read or -1.

At the end an ASCII char is okay, otherwise I position the end at the beginning of a multibyte sequence. Maybe that sequence is completely read, maybe not. Here I just keep it for the next chunk being read.

This code did not see a compiler.

A List of messages is not memory friendly too.

BTW on char[] one would have a similar problem, sometimes a Unicode code point, symbol, is two (UTF-16) chars.