Search code examples
rblogssocial-networkingsocial-graph

Mapping the link network between blogs using R?


I would like any advice on how to create and visualize a link map between blogs so to reflect the "social network" between them.

Here is how I am thinking of doing it:

  1. Start with one (or more) blog home page and collect all the links on that page
  2. Remove all the links that are internal links (that is If I start from www.website.com. Then I want to remove all the links from the shape "www.website.com/***"). But store all the external links.
  3. Go to each of these links (assuming you haven't visited them already), and repeat step 1.
  4. Continue until (let's say) X jumps from the first page.
  5. Plot the data collected.

I imagine that in order to do this in R, one would use RCurl/XML (Thanks Shane for your answer here), combined with something like igraph.

But since I don't have experience with either of them, is there someone here that might be willing to correct me if I missed any important step, or attach any useful snippet of code to allow this task?

p.s: My motivation for this question is that in a week I am giving a talk on useR 2010 on "blogging and R", and I thought this might be a nice way to both give something fun to the audience and also motivate them to do something like this themselves.

Thanks a lot!

Tal


Solution

  • NB: This example is a very BASIC way of getting the links and therefore would need to be tweaked in order to be more robust. :)

    I don't know how useful this code is, but hopefully it might give you an idea of the direction to go in (just copy and paste it into R, it's a self contained example once you've installed the packages RCurl and XML):

    library(RCurl)
    library(XML)
    
    get.links.on.page <- function(u) {
      doc <- getURL(u)
      html <- htmlTreeParse(doc, useInternalNodes = TRUE)
      nodes <- getNodeSet(html, "//html//body//a[@href]")
      urls <- sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])
      urls <- sort(urls)
      return(urls)
    }
    
    # a naieve way of doing it. Python has 'urlparse' which is suppose to be rather good at this
    get.root.domain <- function(u) {
      root <- unlist(strsplit(u, "/"))[3]
      return(root)
    }
    
    # a naieve method to filter out duplicated, invalid and self-referecing urls. 
    filter.links <- function(seed, urls) {
      urls <- unique(urls)
      urls <- urls[which(substr(urls, start = 1, stop = 1) == "h")]
      urls <- urls[grep("http", urls, fixed = TRUE)]
      seed.root <- get.root.domain(seed)
      urls <- urls[-grep(seed.root, urls, fixed = TRUE)]
      return(urls)
    }
    
    # pass each url to this function
    main.fn <- function(seed) {
      raw.urls <- get.links.on.page(seed)
      filtered.urls <- filter.links(seed, raw.urls)
      return(filtered.urls)
    }
    
    ### example  ###
    seed <- "http://www.r-bloggers.com/blogs-list/"
    urls <- main.fn(seed)
    
    # crawl first 3 links and get urls for each, put in a list 
    x <- lapply(as.list(urls[1:3]), main.fn)
    names(x) <- urls[1:3]
    x
    

    If you copy and paste it into R, and then look at x, I think it'll make sense.

    Either way, good luck mate! Tony Breyal