Search code examples
javahttpgroovyhtmlcleaner

HttpUrlConnection to get title of the content and got "Moved Permanently"


This is my code I've written in Groovy to get the page title out of a URL. However, some website I got "Moved Permanently" which I think this is because of the 301 Redirect. How do I avoid this and let the HttpUrlConnection to follow to the right URL and get the correct page title

For example this website I got "Moved Permanently" instead of the correct page title http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html


        def con = (HttpURLConnection) new URL(url).openConnection()
        con.connect()

        def inputStream = con.inputStream

        HtmlCleaner cleaner = new HtmlCleaner()
        CleanerProperties props = cleaner.getProperties()

        TagNode node = cleaner.clean(inputStream)
        TagNode titleNode = node.findElementByName("title", true);

        def title = titleNode.getText().toString()
        title = StringEscapeUtils.unescapeHtml(title).trim()
        title = title.replace("\n", "");
        return title


Solution

  • I can get this to work if I manage the redirecting myself...

    I think the issue is that the site will expect cookies that it sends half way down the redirect chain, and if it doesn't get them, it sends you to a log-in page.

    This code obviously needs some cleaning up (and there is probably a better way to do this), but it shows how I can extract the title:

    @Grab( 'net.sourceforge.htmlcleaner:htmlcleaner:2.2' )
    @Grab( 'commons-lang:commons-lang:2.6' )
    import org.apache.commons.lang.StringEscapeUtils
    import org.htmlcleaner.*
    
    String location = 'http://www.nytimes.com/2011/08/14/arts/music/jay-z-and-kanye-wests-watch-the-throne.html'
    String cookie = null
    String pageContent = ''
    
    while( location ) {
      new URL( location ).openConnection().with { con ->
        // We'll do redirects ourselves
        con.instanceFollowRedirects = false
    
        // If we got a cookie last time round, then add it to our request
        if( cookie ) con.setRequestProperty( 'Cookie', cookie )
        con.connect()
    
        // Get the response code, and the location to jump to (in case of a redirect)
        int responseCode = con.responseCode
        location = con.getHeaderField( "Location" )
    
        // Try and get a cookie the site will set, we will pass this next time round
        cookie = con.getHeaderField( "Set-Cookie" )
    
        // Read the HTML and close the inputstream
        pageContent = con.inputStream.withReader { it.text }
      }
    }
    
    // Then, clean paceContent and get the title
    HtmlCleaner cleaner = new HtmlCleaner()
    CleanerProperties props = cleaner.getProperties()
    
    TagNode node = cleaner.clean( pageContent )
    TagNode titleNode = node.findElementByName("title", true);
    
    def title = titleNode.text.toString()
    title = StringEscapeUtils.unescapeHtml( title ).trim()
    title = title.replace( "\n", "" )
    
    println title
    

    Hope it helps!