Search code examples
javaencodingutf-8base64decoding

Decoding String (from header) encoded by Base64 and RFC2047 in Java


I'm working on a function to decode a string (from a header) that is encoded in both Base64 and RFC2047 in Java.

Given this header:

SGVhZGVyOiBoZWFkZXJ2YWx1ZQ0KQmFkOiBOYW1lOiBiYWRuYW1ldmFsdWUNClVuaWNvZGU6ID0/VVRGLTg/Qj81YmV4NXF5eTU2dUw2SUNNNTZ1TDVMcTY3N3lNNWJleDVxeXk2WUdVNklDTTZZR1U/PSA9P1VURi04P0I/NUxxNjc3eU01YmV4NW9tQTVMaU41cXl5Nzd5TTVZdS81cGE5NXBhODVMcTY0NENDPz0NCg0K

My expected output is:

Header: headervalue Bad: Name: badnamevalue Unicode: 己欲立而立人,己欲達而達人,己所不欲,勿施於人。

The only relevant function that I have found and tried was Base64.decodeBase64(headers), which produces this when printed out:

Header: headervalue Bad: Name: badnamevalue Unicode: =?UTF-8?B?5bex5qyy56uL6ICM56uL5Lq677yM5bex5qyy6YGU6ICM6YGU?= =?UTF-8?B?5Lq677yM5bex5omA5LiN5qyy77yM5Yu/5pa95pa85Lq644CC?=

To solve this, I've been trying MimeUtility.decode() by converting the byte array returned from Base64.decodeBase64(headers) to InputStream, but the result was identical as above.

InputStream headerStream = new ByteArrayInputStream(Base64.decodeBase64(headers));
InputStream result = MimeUtility.decode(headerStream, "quoted-printable");

Have been searching around the internet but have yet found a solution, wondering if anyone knows ways to decode MIME headers from resulted byte arrays?

Any help is appreciated! It's also my first stack overflow post, apologies if I'm missing anything but please let me know if there's more information that I can provide!


Solution

  • The base64 you have there actually is what you pasted. Including the bizarre =?UTF-8?B? weirdness.

    The stuff that follows is again base64.

    There's base64-encoded data inside your base-64 encoded data. As Xzibit would say: I put some Base64 in your base64 so you can base64 while you base64. Why do I feel old all of a sudden?

    In other words, the base64 input you get is a crazy, extremely inefficient format invented by a crazy person.

    My advice is that you tell them to come up with something less insane.

    Failing that:

    Search the resulting string for the regex pattern and then again apply base64 decode to the stuff in the middle.

    Also, you're using some third party base64 decoder, probably apache. Apache libraries tend to suck. Base64 is baked into java, there is no reason to use worse libraries here. I've fixed that; the Base64 in this snippet is java.util.Base64. Its API is slightly different.

    String sourceB64 = "SGV..."; // that input base64 you have.
    byte[] sourceBytes = Base64.decodeBase64(sourceB64);
    String source = new String(sourceBytes, StandardCharsets.UTF_8);
    Pattern p = Pattern.compile("=\\?UTF-8\\?B\\?(.*?)\\?=");
    Matcher m = p.matcher(source);
    StringBuilder out = new StringBuilder();
    int curPos = 0;
    while (m.find()) {
      out.append(source.substring(curPos, m.start()));
      curPos = m.end();
      String content = new String(Base64.getDecoder().decode(m.group(1)), StandardCharsets.UTF_8);
      out.append(content);
    }
    out.append(source.substring(curPos));
    
    System.out.println(out.toString());
    

    If I run that, I get:

    Header: headervalue
    Bad: Name: badnamevalue
    Unicode: 己欲立而立人,己欲達而達 人,己所不欲,勿施於人。
    

    Which looks exactly like what you want.

    Explanation of that code:

    • It first base64-decodes the input, and turns that into a string. (Your idea of using InputStream is a red herring. That doesn't help at all here. You just want to turn bytes into a string, you do it as per line 3 of that snippet. Pass the byte array and the encoding those bytes are in, that's all you need to do).
    • It then goes on the hunt for =?UTF-8?B?--base64here--?= inside your base64. The base64-in-the-base64.
    • It then decoder that base64, turns it into a string in the same fashion, and replaces it.
    • It just adds everything besides those =?UTF-8?B?...?= segments verbatim.