Search code examples
javahtmlhtml-parsingimagehtmlunit

Java program to download images from a website and display the file sizes


I'm creating a java program that will read a html document from a URL and display the sizes of the images in the code. I'm not sure how to go about achieving this though.

I wouldn't need to actually download and save the images, i just need the sizes and the order in which they appear on the webpage.

for example: a webpage has 3 images

<img src="dog.jpg" /> //which is 54kb
<img src="cat.jpg" /> //which is 75kb
<img src="horse.jpg"/> //which is 80kb

i would need the output of my java program to display

54kb
75kb
80kb

Any ideas where i should start?

p.s I'm a bit of a java newbie


Solution

  • If you're new to Java you may want to leverage an existing library to make things a bit easier. Jsoup allows you to fetch an HTML page and extract elements using CSS-style selectors.

    This is just a quick and very dirty example but I think it will show how easy Jsoup can make such a task. Please note that error handling and response-code handling was omitted, I merely wanted to pass on the general idea:

    Document doc = Jsoup.connect("http://stackoverflow.com/questions/14541740/java-program-to-download-images-from-a-website-and-display-the-file-sizes").get();
    
    Elements imgElements = doc.select("img[src]");
    Map<String, String> fileSizeMap = new HashMap<String, String>();
    
    for(Element imgElement : imgElements){
        String imgUrlString = imgElement.attr("abs:src");
        URL imgURL = new URL(imgUrlString);
        HttpURLConnection httpConnection = (HttpURLConnection) imgURL.openConnection();
        String contentLengthString = httpConnection.getHeaderField("Content-Length");
        if(contentLengthString == null)
            contentLengthString = "Unknown";
    
        fileSizeMap.put(imgUrlString, contentLengthString);
    }
    
    for(Map.Entry<String, String> mapEntry : fileSizeMap.entrySet()){
        String imgFileName = mapEntry.getKey();
        System.out.println(imgFileName + " ---> " + mapEntry.getValue() + " bytes");
    }
    

    You might also consider looking at Apache HttpClient. I find it generally preferable over the raw URLConnection/HttpURLConnection approach.