Search code examples
javascripthtmlweb-crawlerhtml-parsing

Extracting all links in a webpage


I want to extract the list of all government websites of India for my survey.

The list is found here : http://goidirectory.nic.in/index.php

The problem here is that the list is not in the form of links. Whenever I need to open a website it opens a new tab and from there it redirects to the website requested.

So, google klipper and other tools which extracts links from website isn't working.

I don't know anything about javascript.

One thing I noticed is that when I put a mouse pointer to the link it shows the name of the website link as shown below:

Mouse pointer

like for eg http://presidentofindia.gov.in comes in the highlight.

I need list of such websites links

Thanks


Solution

  • Hi Kindly check https://jsfiddle.net/9b0wL9tn/

    jQuery

    $(document).ready(function(){
        $('a').each(function(){
      console.log($(this).attr('href'));
    });
    });
    

    NOTE: Open website in chrome >> right click >> inspect >> go to console tab and paste the following and press enter

    Run this code first on console:

    var jq = document.createElement('script');
    jq.src = "https://ajax.googleapis.com/ajax/libs/jquery/2.1.4/jquery.min.js";
    document.getElementsByTagName('head')[0].appendChild(jq);
    // ... give time for script to load, then type.
    jQuery.noConflict();
    

    then run this

    $('a').each(function(){
          console.log($(this).attr('href'));
    });
    

    this will list all the links on the page just copy it from console

    UPDATE

    Have updated the script after following the prev steps... run the following script in console:

    var arr=new Array();
    jQuery('a').each(function(i){
    
    
    arr[i]=jQuery(this).attr('title')+"";
    
    
    });
    
    jQuery.each(arr,function(i){
    if(arr[i].indexOf('http')>-1)
    console.log(arr[i].substr(0, arr[i].indexOf('-')));
    });
    

    here is the screenshot : http://www.imageno.com/lj7tuyr9pt2opic.html