Search code examples
javamemoryutf-8byte8-bit

how to avoid memory wastage when storing UTF-8 characters (8 bit) in Java character (16 bit). two in one?


I'm afraid I have a question on a detail of a rather oversaturated topic, I searched aroudn a lot, but couldn't find a clear answer to that specific obvious -imho- important, problem:

When converting byte[] to String using UTF-8, each byte (8bit) becomes a 8 bit character encoded by UTF-8, but each UTF-8 character is saved as a 16 bit character in java. Is that correct? If yes, this means, that each stupid java character only uses the first 8 bits, and consumes double the memory? Is that correct too? I wonder how this wasteful behaviour is acceptable..

Isn't there some trick to have a pseudo String that is 8 bit? Would that actually result in less memory consumption? Or maybe, is there a way to store >two< 8bit characters in one java 16bit character to avoid this memory waste?

thanks for any deconfusing answers...

EDIT: hi, thanks everybody for answering. I was aware of the variable-length property of UTF-8. However, since my source is byte which is 8 bit, I understood (apparently wrongly) that it needs only 8-bit UTF-8 words. Is UTF-8 conversion actually saving the strange symbols that you see when on the CLI you do "cat somebinary" ? I thought UTF-8 was just somehow used to map each of the possible 8bit words of byte to one particular 8 bit word of UTF-8. Wrong? I thought about using Base64 but it's bad because it uses only 7 bit..

questions reformulated: is there a smarter way to convert byte to something String? May favorite was to just cast byte[] to char[], but then I still have 16bit words.

additional use case info:

I'm adapting Jedis (java client for the NoSQL Redis) as the "primitive storage layer" for hypergraphDB. So, jedis is a database for another "database". My problem is that I have to feed jedis with byte[] data all the time, but internally, >Redis< (the actual server) is dealing only with "binary safe" Strings. Since Redis is written in C, a char is 8 bit long, AFAIK not ASCIII which is 7 bit. In Jedis however, java world, every character is 16 bit long internally. I don't understand this code (yet), but I suppose jedis then converts this java 16 bit strings to a Redis conforming 8 bit string (([here][3]). It says it extends FilterOutputStream. My hope is to bypass the byte[] <-> string conversion altogether and use that Filteroutputstream...? )

now I wonder: if I had to interconvert byte[] and String all the time, with datasizes ranging from very small to potentially very big, isn't there a huge waste of memory to have each 8 bit character passed around as 16bit within java?


Solution

  • Isn't there some trick to have a pseudo String that is 8 bit?

    yes, make sure you have an up to date version of Java. ;)

    http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html

    -XX:+UseCompressedStrings Use a byte[] for Strings which can be represented as pure ASCII. (Introduced in Java 6 Update 21 Performance Release)

    EDIT: This option doesn't work in Java 6 update 22 and is not on by default in Java 6 update 24. Note: it appears this option may slow performance by about 10%.

    The following program

    public static void main(String... args) throws IOException {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < 10000; i++)
            sb.append(i);
    
        for (int j = 0; j < 10; j++)
            test(sb, j >= 2);
    }
    
    private static void test(StringBuilder sb, boolean print) {
        List<String> strings = new ArrayList<String>();
        forceGC();
        long free = Runtime.getRuntime().freeMemory();
    
        long size = 0;
        for (int i = 0; i < 100; i++) {
            final String s = "" + sb + i;
            strings.add(s);
            size += s.length();
        }
        forceGC();
        long used = free - Runtime.getRuntime().freeMemory();
        if (print)
            System.out.println("Bytes per character is " + (double) used / size);
    }
    
    private static void forceGC() {
        try {
            System.gc();
            Thread.sleep(250);
            System.gc();
            Thread.sleep(250);
        } catch (InterruptedException e) {
            throw new AssertionError(e);
        }
    }
    

    Prints this by default

    Bytes per character is 2.0013668655941212
    Bytes per character is 2.0013668655941212
    Bytes per character is 2.0013606946433575
    Bytes per character is 2.0013668655941212
    

    with the option -XX:+UseCompressedStrings

    Bytes per character is 1.0014671435440285
    Bytes per character is 1.0014671435440285
    Bytes per character is 1.0014609725932648
    Bytes per character is 1.0014671435440285