Search code examples

web scraping in imdb using R

I want to find the link to the top 250 movies in imdb. I decided to find a common pattern by viewing the HTML source code. I found "chttp" but I am not sure if it will get me anywhere. How can I find a pattern to construct the links upon it?

g = grep(pattern = "chttp", x = imdb_page)[g]

Here's an example output:

> imdb.lines[1]
[1] "      <h3><a href=\"/chart/?ref_=chttp_cht\" >IMDb Charts</a></h3>"

My main problem is trying to figure out the link(URL) for each of the 250 top movies based on the code I have already written. I basically don't know what's the next step. Also I am not sure the pattern I used the grep command for "chttp" is a good one at all or not.

So according to results starting from index 3 the movie titles are on the odd indices:

> imdb.lines[1]
[1] "      <h3><a href=\"/chart/?ref_=chttp_cht\" >IMDb Charts</a></h3>"
> imdb.lines[2]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0111161/?ref_=chttp_tt_1\" ><img src=\",0,34,50_.jpg\" width=\"34\" height=\"50\" />"
> imdb.lines[3]
[1] "    <a href=\"/title/tt0111161/?ref_=chttp_tt_1\" title=\"Frank Darabont (dir.), Tim Robbins, Morgan Freeman\" >The Shawshank Redemption</a>"
> imdb.lines[6]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0071562/?ref_=chttp_tt_3\" ><img src=\",0,34,50_.jpg\" width=\"34\" height=\"50\" />"
> imdb.lines[4]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0068646/?ref_=chttp_tt_2\" ><img src=\",0,34,50_.jpg\" width=\"34\" height=\"50\" />"
> imdb.lines[5]
[1] "    <a href=\"/title/tt0068646/?ref_=chttp_tt_2\" title=\"Francis Ford Coppola (dir.), Marlon Brando, Al Pacino\" >The Godfather</a>"
> imdb.lines[7]
[1] "    <a href=\"/title/tt0071562/?ref_=chttp_tt_3\" title=\"Francis Ford Coppola (dir.), Al Pacino, Robert De Niro\" >The Godfather: Part II</a>"
> imdb.lines[9]
[1] "    <a href=\"/title/tt0468569/?ref_=chttp_tt_4\" title=\"Christopher Nolan (dir.), Christian Bale, Heath Ledger\" >The Dark Knight</a>"
> imdb.lines[10]
[1] "  <td class=\"posterColumn\"><a href=\"/title/tt0110912/?ref_=chttp_tt_5\" ><img src=\",0,34,50_.jpg\" width=\"34\" height=\"50\" />"


  • xpath makes jobs like this trivial.

    tt <- htmlParse(',desc')
    cbind(xpathSApply(tt, "//td[@class='titleColumn']//a", xmlValue),
               t(xpathSApply(tt, "//td[@class='titleColumn']//a", xmlAttrs)))

    The first argument to cbind returns titles (the text between the a tags) and the second returns the anchors' attributes (href and title, the latter of which in this case contains details about the films' directors).