Search code examples
javahtmlandroid-developer-apiprinting-web-page

Printing the content of web page in Java


I'm trying to read the content of https://example.com/ using HttpURLconnection class. I've removed the html tags between angled braces but I'm failing in removing the words between curled braces. Also there's no space between words that needs to be printed.

Here is the code:

    URL url = new URL("https://example.com/");
    Scanner sc = new Scanner(url.openStream());
    StringBuffer sb = new StringBuffer();
    while(sc.hasNext()) {
        sb.append(sc.next());
         }
    String result = sb.toString();

    //Removing the HTML tags
    result = result.replaceAll("<[^>]*>", " ");
    
    System.out.println("Contents of the web page: "+result);

And this is the output I'm getting:

Contents of the web page: ExampleDomain body{background-color:#f0f0f2;margin:0;padding:0;font-family:-apple-system,system-ui,BlinkMacSystemFont,"SegoeUI","OpenSans","HelveticaNeue",Helvetica,Arial,sans-serif;}div{width:600px;margin:5emauto;padding:2em;background-color:#fdfdff;border-radius:0.5em;box-shadow:2px3px7px2pxrgba(0,0,0,0.02);}a:link,a:visited{color:#38488f;text-decoration:none;}@media(max-width:700px){div{margin:0auto;width:auto;}} ExampleDomain Thisdomainisforuseinillustrativeexamplesindocuments.Youmayusethisdomaininliteraturewithoutpriorcoordinationoraskingforpermission. Moreinformation...

How to remove the content between curled braces? and how to put space between the words in sentences?


Solution

  • For the removal of content between curly braces, you can use String#replaceAll(String, String). Javadoc

    str.replaceAll("\\{.*\\}", "");
    

    This regex matches all characters between opening and closing braces. So your code would be:

    URL url = new URL("https://example.com/");
    Scanner sc = new Scanner(url.openStream());
    StringBuffer sb = new StringBuffer();
    while (sc.hasNext()) {
        sb.append(" " + sc.next());
    }
    String result = sb.toString();
    
    // Removing the HTML tags
    result = result.replaceAll("<[^>]*>", "");
    
    // Removing the CSS stuff
    result = result.replaceAll("\\{.*\\}", "");
    
    System.out.println("Contents of the web page: " + result);