I am creating an IMAP client for Google's Gimap service (IMAP4v2 with some extensions).
I have a small number of emails for which, the FETCH command is returning the UTF-8 replacement character. Meanwhile, the character is displayed correctly in the Gmail web ui.
For an orientation, here's what the protocol exchange looks like with openssl s_client:
openssl s_client -crlf -quiet -connect imap.gmail.com:993 * OK Gimap ready for requests from 80.229.146.237 q15mb3103507wmo 1 LOGIN [email protected] app-password * CAPABILITY IMAP4rev1 UNSELECT IDLE NAMESPACE QUOTA ID XLIST CHILDREN X-GM-EXT-1 UIDPLUS COMPRESS=DEFLATE ENABLE MOVE CONDSTORE ESEARCH UTF8=ACCEPT LIST-EXTENDED LIST-STATUS LITERAL- SPECIAL-USE APPENDLIMIT=35651584 1 OK [email protected] authenticated (Success) 2 EXAMINE "[Gmail]/All Mail" * FLAGS (\Answered \Flagged \Draft \Deleted \Seen $Forwarded $Junk $MailFlagBit0 $NotJunk $NotPhishing $Phishing Forwarded Junk NotJunk) * OK [PERMANENTFLAGS ()] Flags permitted. * OK [UIDVALIDITY 1] UIDs valid. * 95696 EXISTS * 0 RECENT * OK [UIDNEXT 1110737] Predicted next UID. * OK [HIGHESTMODSEQ 11952911] 2 OK [READ-ONLY] [Gmail]/All Mail selected. (Success) 3 UID FETCH 936238 (BODY[HEADER.FIELDS (Subject)]) * 44300 FETCH (UID 936238 BODY[HEADER.FIELDS (Subject)] {64} Subject: Luc�a Cxxxxxxxxxxxxxxx posted a discussion on ASW ) 3 OK Success 4 LOGOUT * BYE LOGOUT Requested LOGOUT OK 73 good day (Success) read:errno=0
The terminal shows the UTF-8 replacement character, but that's not a necessary indicator of the problem. The true finding is when I read the bytes in Java code, Gimap passes me back the three UTF-8 bytes encoding the Unicode replacement character. I have underlined them below:
[ 53 , 75 , 62 , 6A , 65 , 63 , 74 , 3A , 20 , 4C , 75 , 63 , EF , BF , BD , 61 , 20 ,
^^^ ^^^ ^^^
Meanwhile, the email displays correctly in Gmail. Here is Google's "view original"...
By RFC2822 all these header keys and values should be in 7-bit ASCII characters. That is obviously not the case with this email - but how does that explain that Google "knows" what the correct character is, but won't provide that same character to me through their API?
I just wondered, if possibly I am not using IMAP4v1 correctly? I notice Gimap supports "ENABLE=UTF-8", but the addition of this command sequence did not result in a difference in the behaviour.
Summary:
Observation: Original email to Google was invalid, and Google figured out how to store the proper character that was intended by the sender.
Expected result: Google would give me the bytes they display, when I request this field
Actual result: Google does not give me any bytes to describe the í. Instead, Google's API returns to me bytes for the replacement character.
Discussion: Maybe it is a Google implementation error - but I came here with my post to ensure I am actually handling the protocol interaction incorrectly.
Below is my minimal test case, Java code that I can use to access and print out the bytes.
package stackoverflow;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.regex.Pattern;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintWriter;
import java.net.InetSocketAddress;
import java.net.Socket;
import javax.net.ssl.SSLSocketFactory;
public class StackOverflowTestCase {
/**
* Connect to Gimap server, download a single email's Subject header.
*/
public static void main( String[] argzes ) throws IOException {
SimpleIMAPClient g = new SimpleIMAPClient();
try {
g.changeToLoggedInState().changeToSelectedState();
String input = "";
Pattern expectedResponse = Pattern.compile( "\\* [1234567890]+ FETCH \\(UID 936238 .*" );
g.sendCommand( "4 UID FETCH 936238 (BODY[HEADER.FIELDS (Subject)])" );
while (!input.startsWith( "4 OK " )) {
input = g.readLine();
if ( expectedResponse.matcher( input ).matches() ) {
// find the {999} construct in the response and parse the number.
int bytesToRead = Integer.parseInt( input.split( "\\{|}" )[1] );
byte[] bytes = g.response.readNBytes( bytesToRead );
log( bytes );
}
}
}
finally {
g.logout();
}
}
/**
* Pretty-print a byte array.
*/
private static void log( byte[] bytes ) {
StringBuffer out = new StringBuffer();
out.append( '[' );
String prefix = " ";
for (byte b : bytes) {
out.append( prefix + String.format( "%02X ", b ) );
prefix = ", ";
}
out.append( ']' );
System.out.println( out.toString() );
}
/**
* Class that implements a subset of the IMAP 4v2 protocol.
* Only enough function is available to reproduce the test case for downloading one email header.
* */
private static class SimpleIMAPClient {
/**
* When true, the line exchanges between IMAP client and IMAP server
* are printed to stdout and stderr respectively.
* */
static final boolean debug = true;
/**
* IMAP connection state, per RFC3501#3
* */
public static class State {
public static final State NOT_AUTHENTICATED = new State();
public static final State AUTHENTICATED = new State();
public static final State SELECTED = new State();
public static final State LOGGED_OUT = new State();
private State () {
}
}
/**
* Outbound stream to the server
* */
public final PrintWriter request;
/**
* Inbound stream from the server
*/
public final InputStream response;
private final Socket imap;
private State state;
/**
* Constructs a connection to the Gmail Gimap server using the default ciphers available in the JDK.
* If established successfully, the object enters NOT_AUTHENTICATED state.
* @throws ExceptionInInitializerError if the connection cannot be established.
* */
public SimpleIMAPClient () {
try {
imap = SSLSocketFactory.getDefault().createSocket();
imap.connect( new InetSocketAddress( "imap.gmail.com", 993 ) );
System.out.println( "Connection established to " + imap.getRemoteSocketAddress() + " from local port "
+ imap.getLocalPort() );
request = new PrintWriter( imap.getOutputStream(), true );
response = imap.getInputStream();
}
catch ( IOException e ) {
throw new ExceptionInInitializerError( e );
}
this.state = State.NOT_AUTHENTICATED;
}
/**
* Moves the state to LOGGED_OUT. The class can no longer be used after calling this method.
* */
public void logout() throws IOException {
sendCommand( "LOGOUT LOGOUT" );
String input = readLine();
while (!input.startsWith( "LOGOUT OK" )) {
input = readLine();
}
response.close();
request.close();
imap.close();
System.out.println( "Disconnected." );
this.state = State.LOGGED_OUT;
}
/**
* Change the state to SELECTED by opening the "All Mail" mailbox in
* readonly mode. Returns a reference to this object.
* @throws IllegalStateException if the beginning state is not AUTHENTICATED.
*/
public SimpleIMAPClient changeToSelectedState() throws IOException {
if ( !state.equals( State.AUTHENTICATED ) )
throw new IllegalStateException();
sendCommand( "3 EXAMINE \"[Gmail]/All Mail\"" );
String input = readLine();
while (!input.startsWith( "3 OK " )) {
input = readLine();
}
this.state = State.SELECTED;
return this;
}
/**
* Change the state to AUTHENTICATED; and then returns a reference to
* this object.
* TODO Note, the account and app-password are hard-coded into this method implementation.
* @throws IllegalStateException if the beginning state is not NOT_AUTHENTICATED.
*/
public SimpleIMAPClient changeToLoggedInState() throws IOException {
if ( !state.equals( State.NOT_AUTHENTICATED ) )
throw new IllegalStateException();
String input = readLine();
while (!input.startsWith( "* OK " )) {
input = readLine();
}
sendCommand( "1 LOGIN [email protected] ***********" );
while (!input.startsWith( "1 OK " )) {
input = readLine();
}
this.state = State.AUTHENTICATED;
return this;
}
/**
* Send the command to the IMAP server.
*/
public void sendCommand( String message ) {
request.print( message );
if ( !message.endsWith( "\r\n" ) )
request.print( "\r\n" );
request.flush();
System.out.println( message );
}
/**
* Reads String in the default Charset. Stray \\r or \\n characters will raise
* IOException. Only the \\r\\n sequence is allowed per RFC 3501#2.2.
*
* @returns The line, identified by its terminator \\r\\n.
*/
public String readLine() throws IOException {
StringBuffer line = new StringBuffer();
int ch = response.read();
char prev = '\0';
while (ch != -1) {
if ( line.length() == 0 ) {
if ( ( char ) ch == '\n' ) {
throw new IOException( "Line begins with \\n" );
}
} else {
if ( prev == '\r' && ( char ) ch == '\n' ) {
line.setLength( line.length() - 1 ); // chop off \r
System.err.println( line.toString() );
return line.toString();
} else if ( ( char ) ch == '\n' ) {
throw new IOException( "Encountered \\n without \\r." );
}
}
prev = ( char ) ch;
line.append( prev );
ch = response.read();
}
throw new IOException( "end of stream." );
}
}
}
In my past experience with GIMAP server, if it had to parse the headers to fulfill your request, it would tend to eat illegal characters. You can work around this by fetching the whole message in one go, and then the server will likely give you the raw data, unfiltered.
I recommend trying to fetch RFC822 or the whole BODY section.