Search code examples
web-crawlerhtml-parsingcrawler4j

Crawling and extracting info using crawler4j


I need help figuring out how to crawl through this page: http://www.marinetraffic.com/en/ais/index/ports/all go through each port, and extract the name and coordinates and write them onto a file. The main class looks as follows:

import java.io.FileWriter;

import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.crawler.CrawlController;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;


public class WorldPortSourceCrawler {

    public static void main(String[] args) throws Exception {
         String crawlStorageFolder = "data";
         int numberOfCrawlers = 5;

         CrawlConfig config = new CrawlConfig();
         config.setCrawlStorageFolder(crawlStorageFolder);
         config.setMaxDepthOfCrawling(2);
         config.setUserAgentString("Sorry for any inconvenience, I am trying to keep the traffic low per second");
         //config.setPolitenessDelay(20);
         /*
          * Instantiate the controller for this crawl.
          */
         PageFetcher pageFetcher = new PageFetcher(config);
         RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
         RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
         CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

         /*
          * For each crawl, you need to add some seed urls. These are the first
          * URLs that are fetched and then the crawler starts following links
          * which are found in these pages
          */
         controller.addSeed("http://www.marinetraffic.com/en/ais/index/ports/all");

         /*
          * Start the crawl. This is a blocking operation, meaning that your code
          * will reach the line after this only when crawling is finished.
          */
         controller.start(PortExtractor.class, numberOfCrawlers);    

         System.out.println("finished reading");
         System.out.println("Ports: " + PortExtractor.portList.size());
         FileWriter writer = new FileWriter("PortInfo2.txt");

         System.out.println("Writing to file...");
         for(Port p : PortExtractor.portList){
            writer.append(p.print() + "\n");
            writer.flush();
         }
         writer.close();
        System.out.println("File written");
        }
}

While the Port Extractor looks like this:

public class PortExtractor extends WebCrawler{

    private final static Pattern FILTERS = Pattern.compile(
            ".*(\\.(css|js|bmp|gif|jpe?g"
            + "|png|tiff?|mid|mp2|mp3|mp4"
            + "|wav|avi|mov|mpeg|ram|m4v|pdf"
            + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"
        );

    public static List<Port> portList = new ArrayList<Port>();

/**
*
* Crawling logic
*/
//@Override
public boolean shouldVisit(WebURL url) {

String href = url.getURL().toLowerCase();
//return  !FILTERS.matcher(href).matches()&&href.startsWith("http://www.worldportsource.com/countries.php") && !href.contains("/shipping/") && !href.contains("/cruising/") && !href.contains("/Today's Port of Call/") && !href.contains("/cruising/") && !href.contains("/portcall/") && !href.contains("/localviews/") && !href.contains("/commerce/")&& !href.contains("/maps/") && !href.contains("/waterways/");
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.marinetraffic.com/en/ais/index/ports/all");
}



/**
* This function is called when a page is fetched and ready 
* to be processed
*/
@Override
public void visit(Page page) {          
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);

   }

}

How do I go about writing the html parser, also how can I specify to the program that it should not crawl through anything other than the port info links? I'm having difficulty with this as even with the code running, it breaks everytime I try to work with the HTML parsing. Please any help would be much appreciated.


Solution

  • First task is to check the robots.txt of the site in order to checkl, whether crawler4j will acutally crawl this website. Investigating this file, we find, that this will no problem:

    User-agent: *
    Allow: /
    Disallow: /mob/
    Disallow: /upload/
    Disallow: /users/
    Disallow: /wiki/
    

    Second, we need to figure out, which links are of particular interest for your purpose. This needs some manual investigation. I only checked a few entries of the link mentioned above, but I found, that every port contains the keyword ports in its link, e.g.

    http://www.marinetraffic.com/en/ais/index/ports/all/per_page:50
    http://www.marinetraffic.com/en/ais/details/ports/18853/China_port:YANGZHOU
    http://www.marinetraffic.com/en/ais/details/ports/793/Korea_port:BUSAN
    

    With this information, we are able to modify the shouldVisit method in a whitelisting manner.

    public boolean shouldVisit(Page referringPage, WebURL url){
    
    String href = url.getURL().toLowerCase();
    return  !FILTERS.matcher(href).matches()
            && href.contains("www.marinetraffic.com");
            && href.contains("ports");
    }
    

    This is a very simple implementation, which could be enhanced by regular expressions.

    Third, we need to parse the data out of the HTML. The information you are looking for is contained in the following <div> section:

    <div class="bg-info bg-light padding-10 radius-4 text-left">
        <div>
            <span>Latitude / Longitude: </span>
            <b>1.2593655° / 103.75445°</b>
            <a href="/en/ais/home/zoom:14/centerx:103.75445/centery:1.2593655" title="Show on Map"><img class="loaded" src="/img/icons/show_on_map_magnify.png" data-original="/img/icons/show_on_map_magnify.png" alt="Show on Map" title="Show on Map"></a>
            <a href="/en/ais/home/zoom:14/centerx:103.75445/centery:1.2593655/showports:1" title="Show on Map">Show on Map</a>
        </div>
    
        <div>
            <span>Local Time:</span>
                    <b><time>2016-12-11 19:20</time>&nbsp;[UTC +8]</b>
        </div>
    
                <div>
                <span>Un/locode: </span>
                <b>SGSIN</b>
            </div>
    
                <div>
                <span>Vessels in Port: </span>
                <b><a href="/en/ais/index/ships/range/port_id:290/port_name:SINGAPORE">1021</a></b>
            </div>
    
                <div>
                <span>Expected Arrivals: </span>
                <b><a href="/en/ais/index/eta/all/port:290/portname:SINGAPORE">1059</a></b>
            </div>
    
    </div>
    

    Basically, I would use a HTML Parser (e.g. Jericho) for this task. Then, you are able to exactly extract the correct <div> section and obtain the attributes you are looking for.