Search code examples
javahtmlswingparsinghref

How to get Full/Absolute link from HREF tag using javax.swing.text.html?


I'm trying to get the links in a website, and put them on a List, but I constantly get incomplete links without the root site. For example I get something like /thing.html/ instead of http://website.com/thing.html/

It is meant to be a search engine, so I need to parse the website's links too, and I need the full link in order to do that.

I am also not allowed to use any third party Library such as JSoup, and that is why I'm using javax.swing.text.html in order to do that.

I think you can do something like anchor.attr("abs:href")using Jsoup, that's kind of the same thing I need here.

Here is the code I have so far:

import java.util.List;
import java.util.ArrayList;
import java.net.*;
import java.io.*;

import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.html.HTML.Attribute;
import javax.swing.text.MutableAttributeSet; 

public class PARSER {

public static List<String> getLinks(BufferedReader BuffRead) throws IOException {
final ArrayList<String> list = new ArrayList();

ParserDelegator parserDelegator = new ParserDelegator();
ParserCallback parserCallback = new ParserCallback() {
  public void handleText(final char[] data, final int pos) { }
  public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) {
    if (tag == Tag.A) {
      String address = (String) attribute.getAttribute(Attribute.HREF);
              //This is where I get the HREF "links" 
      list.add(address);
    }
  }
  public void handleEndTag(Tag t, final int pos) {  }
  public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
  public void handleComment(final char[] data, final int pos) { }
  public void handleError(final java.lang.String errMsg, final int pos) { }
};
parserDelegator.parse(BuffRead, parserCallback, false);
return list;
}

Solution

  • First: Consider don't writing your class names in caps lock Parser or MyParser with a starting capital suffices ;)

    If you are only crawling one Website there are probably pretty much relative links to find. It is common to use them internally and for relative links the results you are getting are right. Do you know that there are external links on the Website you are parsing?

    I dont't know in what environment you call your Parser but if you just call Parser.getLinks(someBuffer) without knowledge of the Website you're parsing you are just left with the links you find. If you are parsing online sites could just add the base url. Since you know what site you are on right now, you could pass the url and add it to your relative link:

    The methodInterface would look like that

    public static List<String> getLinks(BufferedReader BuffRead, String baseUrl) throws IOException 
    

    And you would check for relative links with something like that(this is very simple)

    if (tag == Tag.A) {
      String address = (String) attribute.getAttribute(Attribute.HREF);
      //if(!address.startsWith("http"))  should work too as a primitive absolute link 
      //often starts with "http" as protocol
      if(address.startsWith("/")||address.startsWith("..")){
        address = baseUrl + address;
      }  
      list.add(address);
    }
    

    Greetings