I have a nightly job that is using net.schmizz.sshj
SFTP library.
I suspect that there was a change on the server that we SFTP into. However, I cannot pinpoint what the exact change is. Nor can I reproduce it outside the production server. So I am looking for insight of a root cause or how to reproduce the following error
net.schmizz.sshj.transport.TransportException: Timeout expired
at net.schmizz.sshj.transport.TransportException$1.chain(TransportException.java:33)
at net.schmizz.sshj.transport.TransportException$1.chain(TransportException.java:27)
at net.schmizz.concurrent.Promise.retrieve(Promise.java:139)
at net.schmizz.concurrent.Event.await(Event.java:103)
at net.schmizz.sshj.transport.KeyExchanger.waitForDone(KeyExchanger.java:160)
at net.schmizz.sshj.transport.KeyExchanger.startKex(KeyExchanger.java:155)
at net.schmizz.sshj.transport.TransportImpl.doKex(TransportImpl.java:247)
at net.schmizz.sshj.SSHClient.doKex(SSHClient.java:735)
at net.schmizz.sshj.SSHClient.onConnect(SSHClient.java:723)
at net.schmizz.sshj.SocketClient.connect(SocketClient.java:54)
at net.schmizz.sshj.SocketClient.connect(SocketClient.java:80)
From what I can tell, we're connecting via a java.net.Socket
ok and then timing out waiting for the key exchange to complete in this code
package net.schmizz.sshj.transport;
/** Algorithm negotiation and key exchange. */
final class KeyExchanger
implements SSHPacketHandler, ErrorNotifiable {
/**
* Starts key exchange by sending a {@code SSH_MSG_KEXINIT} packet. Key exchange needs to be done once mandatorily
* after initializing the {@link Transport} for it to be usable and may be initiated at any later point e.g. if
* {@link Transport#getConfig() algorithms} have changed and should be renegotiated.
*
* @param waitForDone whether should block till key exchange completed
*
* @throws TransportException if there is an error during key exchange
* @see {@link Transport#setTimeoutMs} for setting timeout for kex
*/
void startKex(boolean waitForDone)
throws TransportException {
if (!kexOngoing.getAndSet(true)) {
done.clear();
sendKexInit();
}
if (waitForDone)
waitForDone();
}
/**
* Sends SSH_MSG_KEXINIT and sets the {@link #kexInitSent} event.
*
* @throws TransportException
*/
private void sendKexInit()
throws TransportException {
log.debug("Sending SSH_MSG_KEXINIT");
clientProposal = new Proposal(transport.getConfig());
transport.write(clientProposal.getPacket());
kexInitSent.set();
}
void waitForDone()
throws TransportException {
done.await(transport.getTimeoutMs(), TimeUnit.MILLISECONDS);
}
After much debugging after posting this question I figured out that the problem was that we don't have the same encryption libraries running in the production environments as I did in my local testing. I determined this by building the minimal jar that I could while still retaining all the SSHJ pieces used in production and all the relevant code of our own and then running it with DEBUG level logging in a command shell where I could see all output. Then I saw the following and I realized that the timeout exception in our production server logs is a second exception occurring after and masking the root cause which is not being logged in production: a NoClassDefFoundError
Exception in thread "reader" java.lang.NoClassDefFoundError: org/bouncycastle/crypto/ec/CustomNamedCurves
at net.schmizz.sshj.transport.kex.Curve25519DH.getCurve25519Params(Curve25519DH.java:51)
at net.schmizz.sshj.transport.kex.Curve25519SHA256.initDH(Curve25519SHA256.java:36)
at net.schmizz.sshj.transport.kex.AbstractDHG.init(AbstractDHG.java:55)
at net.schmizz.sshj.transport.KeyExchanger.gotKexInit(KeyExchanger.java:239)
at net.schmizz.sshj.transport.KeyExchanger.handle(KeyExchanger.java:359)
at net.schmizz.sshj.transport.TransportImpl.handle(TransportImpl.java:493)
at net.schmizz.sshj.transport.Decoder.decode(Decoder.java:107)
at net.schmizz.sshj.transport.Decoder.received(Decoder.java:175)
at net.schmizz.sshj.transport.Reader.run(Reader.java:60)
Caused by: java.lang.ClassNotFoundException: org.bouncycastle.crypto.ec.CustomNamedCurves
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more
net.schmizz.sshj.transport.TransportException: Timeout expired
at net.schmizz.sshj.transport.TransportException$1.chain(TransportException.java:33)
at net.schmizz.sshj.transport.TransportException$1.chain(TransportException.java:27)
at net.schmizz.concurrent.Promise.retrieve(Promise.java:139)
at net.schmizz.concurrent.Event.await(Event.java:103)
at net.schmizz.sshj.transport.KeyExchanger.waitForDone(KeyExchanger.java:160)
at net.schmizz.sshj.transport.KeyExchanger.startKex(KeyExchanger.java:155)
at net.schmizz.sshj.transport.TransportImpl.doKex(TransportImpl.java:247)
at net.schmizz.sshj.SSHClient.doKex(SSHClient.java:735)
at net.schmizz.sshj.SSHClient.onConnect(SSHClient.java:723)
at net.schmizz.sshj.SocketClient.connect(SocketClient.java:54)
at net.schmizz.sshj.SocketClient.connect(SocketClient.java:80)
at kirby.App.connect(App.java:101)
at kirby.App.execute(App.java:45)
at kirby.App.main(App.java:31)
Caused by: java.util.concurrent.TimeoutException: Timeout expired
... 12 more
The difference comes down to classpath. In production we had
bcpg-jdk15on-150.jar:bcpkix-jdk15on-150.jar:bcprov-jdk15on-150.jar
but need
bcpkix-jdk15on/1.51/bcpkix-jdk15on-1.51.jar:ecc-25519-java-1.0.1.jar:bcprov-jdk15on/1.51/bcprov-jdk15on-1.51.jar