Detect protocol from java SOCKS socket

I'm developing a custom SOCKS5 server in Java. Other than the first CONNECT message that includes the HOST and PORT, is then any way to inspect the subsequent messages to determine the protocol of the data? For example, if the application data starts with "GET /...", the request is likely HyperText Transfer Protocol (HTTP), but that is far from a complete solution. Is there a way to see if the data is say HTTPS, or FTP, or "NetFlix streaming", etc...?

Secondarily, if the data is http or https how would I forward the request to a dedicated HTTP proxy?

Solution

is then any way to inspect the subsequent messages to determine the protocol of the data?... s there a way to see if the data is say HTTPS, or FTP, or "NetFlix streaming", etc...?

Basically you have destination port, destination IP address and maybe hostname (if DNS resolving is done through the SOCKS5 server too) and the payload. Based on the knowledge of well known target hosts, target ports and typical payloads you could build heuristics to guess the protocol.

You will find such heuristics in today's Intrusion Detection Systems, better firewalls and traffic classifiers and they differ a lot in the detection quality and a determined user can often fool these heuristics. This is a very wide topic but you might start looking at free deep inspection (DPI) libraries like nDPI and read more about DPI at Wikipedia.

Secondarily, if the data is http or https how would I forward the request to a dedicated HTTP proxy?

First change the target from the target requested by the client to the proxy. This must be done of course before any data gets transferred which might conflict with the DPI you do on the data stream because some connections first get data from the server (like SMTP) while others like HTTP(S) first get data from the client. Thus you probably need to find out if this is HTTP(S) before getting any payload, i.e. only based on target port. For HTTPS you would then need then to establish a tunnel using a CONNECT request as described in RFC 2817. For HTTP you would modify the request to include not only the path but the full URL (i.e. http://host[:port]/path).

As you can see all of this uses lots of heuristics which work for most but not all cases. Apart from that this can be a very complex task depending on the quality of traffic classification you need.