Search code examples
javahtmlurlfile-ioinputstream

What is InputStreamReader really? and URL?


My code works but I looked up a tutorial for how to read a web page and before i have lightly dabbled if File i/o but nothing like this. I don't really know whats going on with inputStreamReader and the URL identifier? Here is my code:

import java.net.*;
import java.util.*;
import java.io.*;

public class urlReader {

    public static void main(String[] args) {
        URL[] websites = new URL[4];

        URLConnection conn = null;
        // Try catch statement that handles no URL found and i/o exceptions //
        try {
            // a list of websites sorted into arrays // 
            websites[0] = new URL("https://www.pravdareport.com");
            websites[1] = new URL("https://pravda.ru");
            websites[2] = new URL("https://www.lefigaro.fr");
            websites[3] = new URL("https://www.independent.co.uk");
            
            // For each website it assigns the websites to webpage after each iteration of the entire loop //
            for (int i = 0; i < websites.length; i++) {
                URL webpage = websites[i];
                conn = webpage.openConnection(); // This is used so that for each website it ''opens'' the connection //
                InputStreamReader reader = new InputStreamReader(conn.getInputStream(), "UTF8"); // 
                BufferedReader br = new BufferedReader(reader);
                String lines = "";
                while ((lines = br.readLine()) != null) {
                    
                        
                        if (lines.indexOf("<title>") != -1 && lines.indexOf("</title>") != -1) {
                            String title1 = lines.substring(lines.indexOf("<title>") + 7, lines.indexOf("</title>"));
                            System.out.println("The Title of pravdareport.com is: " + title1);
                            break;
                        }
                        

                
            }
            
            
                
                
            }
            
        } catch (MalformedURLException e) {
            e.printStackTrace();
            
        } catch (IOException e) {
            e.printStackTrace();
        }
        

    }

}

Solution

  • URLConnection

    A highly abstract thing - it represents a connection to a URL. It doesn't have to be active (in that there is a separate .connect() method, and things like .getInputStream() implicitly connect if it isn't already), and it can be to anything - a file, a resource in a jar, a BLOB in an SQL-based database, or a connection to a webserver, which is what is happening here.

    Such connections can be two-way (where you can send data to them, as well as receive data from them) but doesn't have to be. Web connections are one-way, unless it's a POST call, such as when you submit a web form, where it's two way (you send the form data, which might be a lot if it includes a file upload element, and then you receive the response).

    InputStream

    As produced by con.getInputStream(). An InputStream is a thing that can produce a stream of bytes. It may be endless (for example, a stream that produces random numbers, as many as you want, or a web connection, which could be endless if the server decides to serve bytes as long as you continue to ask for them), but more usually it will end. You don't know when - another property of inputstreams is that the source that provides these bytes doesn't have to tell you in advance if it will ever end or when that will happen.

    con.getInputStream() represents the data that the server sends. It's just the data - the 'raw data' sent directly by a webserver includes headers and the like. con.getInputStream() represents only the 'payload' - for example, when connecting to https://www.pravda.ru/ (if for some reason you want to do that), con.getInputStream() returns an InputStream that gives you the HTML.

    InputStreamReader

    InputStream provides bytes and bytes are not characters. If you're receiving ASCII, there is virtually no difference between bytes and characters, but, it's 2023, and there are way more characters than that, so, for example, most websites are sent as UTF-8 where 1 to 4 bytes represents a character. Reader is to InputStream as 'character' is to 'byte' - where InputStream gives you bytes until the stream is done, a Reader gives you chars until it is done.

    An InputStreamReader bridges them - it turns an inputstream into a reader, by giving it an InputStream, and a charset encoding. This code provides a non-standard way to say "it is UTF-8". StandardCharsets.UTF8 is the standard way, "UTF8" is less standard (if you typo UTF8, you won't know until you run it, whereas if you typo StandardCharsets.UTF8 you know as you type, because your editor will immediately red-wavy-underline it).

    BufferedReader

    One 'issue' that connections can have is that they have peculiar chunking behaviour. For example, on just about any disk system (be it a spinning disk or a bunch of SSD cells) cannot give you single bytes. It can only give you fairly large chunks of them. This is a problem: If your code processes things one at a time, then an InputStream has to be allowed to be really wasteful (read an entire disk sector in, and then toss em all in the garbage except the one byte you asked for), because the alternative requires that the object allocate a bunch of RAM to store that stuff. Usually you'd want it to do that, but every so often you don't. Of course, if you 'wrap' that InputStream into a Reader with InputStreamReader you run into the same problem.

    BufferedReader lets you opt into: Yes, please allocate memory for efficiency. a BufferedReader wraps around a Reader and doesn't change anything at all (it has the same methods - it is also a Reader), except its spec indicates that if you call read() (which returns a single character), under the hood BufferedReader actually grabs a whole boatload of them, gives you the first one, and stores the rest in an internal cache, and will continue to use that cache as you call read() until it has run dry in which case It'll ask the underlying InputStreamReader for a bunch more. Which will in turn call its underlying InputStream (which you got from con.getInputStream()) for a bunch of bytes.

    URL

    A URL Is a "Universal Resource Locator". It's a spec for writing a structured line of text that identifies a resource and how to load it. The web uses URLs - it's what you can type into a browser to tell it where to connect to and what to get from there. https://www.independent.co.uk/europe is a URL that represents:

    • https - use the https protocol to connect to a server to fetch this resource.
    • www.independent.co.uk - ask the DNS server (where is that? The system knows. Which it knows because when it connected to the network, the router in your closet or whatnot told it where it is) for the IP address that goes with www.independent.co.uk, then connect to that IP.
    • /europe - and ask the server that answers there for this resource.

    URLs don't have to describe web pages. file:///foo/bar/baz is a URL. So is ftp://ftp.foobar.com/path/to/some/file. So is my-bank-app://whatever which is how iPhones and android let websites open apps. It's a very flexible system. It's essentially someprotocol://someserver/someresourceidentifier where all those 3 things can be just about anything.