I have a Java application which opens an existing company's website using the Socket
class:
Socket sockSite;
InputStream inFile = null;
BufferedWriter out = null;
try
{
sockSite = new Socket( presetSite, 80 );
inFile = sockSite.getInputStream();
out = new BufferedWriter( new OutputStreamWriter(sockSite.getOutputStream()) );
}
catch ( IOException e )
{
...
}
out.write( "GET " + presetPath + " HTTP/1.1\r\n\r\n" );
out.flush();
I would read the website with the stream inFile
and life is good.
Recently this started to fail. I was getting an HTTP 301
"site has moved" error but no moved-to link. The site still exists and responds using the same original HTTP reference and any web browser. But the above code comes back with the HTTP 301
.
I changed the code to this:
URL url;
InputStream inFile = null;
try
{
url = new URL( presetSite + presetPath );
inFile = url.openStream();
}
catch ( IOException e )
{
...
}
And read the site with the original code from inFile
stream and it now works again.
This difference doesn't just occur in Java but it also occurs if I use Perl (using IO::Socket::INET
approach opening the website port 80, then issuing a GET
fails, but using LWP::Simple
method get
just works). In other words, I get a failure if I open the web page first with port 80, then do a GET
, but it works fine if I use a class which does it "all at once" (that just says, "get me web page with such-and-such an HTTP address").
I thought I'd try the different approaches on http://www.microsoft.com
and got an interesting result. In the case of opening port 80, followed by issuing the GET /...
, I received an HTTP 200
response with a page that said, "Your current user agent
In one case, I tried the "port 80" open followed by GET /
on www.microsoft.com
and I received an HTTP 200
response page that said, "Your current user agent appears to be from an automated process...". But if I use the second method (URL
class in Java, or LWP
in Perl) I simply get their web page.
So my question is: how does the URL class (in Java) or the LWP module (in Perl) do its thing under the hood that makes it different from opening the website on port 80 and issuing a GET
?
Most servers require the Host:
header, to allow virtual hosting (multiple domains on one IP)