Search code examples
rweb-scrapinghtml-table

How do I webscrape a table from neurosynth with R?


I am trying to webscrape some table data from neurosynth to do with fmri data. https://www.neurosynth.org/locations/2_2_2_6/ (it doesn't matter about what data for now. I just want to be able to get data from the table on associations section of locations page)

I have managed to webscrape a simple wikipedia page using the following code:

          url = 
"https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
          read_html(url) %>%
            html_element("table") %>%
            html_table() %>%
           

worked absolutely fine no problem. I try the same thing with my neurosynth data, i.e.:

             neurosynth_link = "https://www.neurosynth.org/locations/2_2_2_6/"
             read_html(neurosynth) %>%
               html_element("table") %>%
               html_table() 

I get:

# A tibble: 0 × 4 
# … with 4 variables: Title <lgl>, Authors <lgl>, Journal <lgl>, Activations 
<lgl>

Doesn't work.

I have played around a bit and have managed to get the headings of the table that i want (z-score, posterior prob, etc.) with the following code:

neurosynth_link = "https://www.neurosynth.org/locations/2_2_2_6/"
neurosynth_page = read_html(neurosynth)
neuro_synth_table = neurosynth_page %>% html_nodes("table#location_analyses_table")  
 %>%
 html_table() 
neuro_synth_table

[[1]]
# A tibble: 1 × 5
 ``    `Individual voxel` `Individual voxel` `Seed-based network` `Seed-based 
network`    
<chr> <chr>              <chr>              <chr>                <chr>                   
 1 Name  z-score            Posterior prob.    Func. conn. (r)      Meta-analytic 
coact. (r)

But that's as far as I can get. What's going on?


Solution

  • The table you want is generated by javascript, so doesn't actually exist within the static html you are trying to scrape. The javascript downloads a separate json file that contains all the data for every page of the table.

    This is actually good news - it means you can get the entries for all 134 pages of the data you are trying to scrape all at once. We can find the json file's url in the developer tab in the browser and use that. With a little bit of wrangling we get all the data in a single data frame. Here's a full reprex:

    library(httr)
    
    url    <- "https://www.neurosynth.org/api/locations/2_2_2_6/compare?_=1645644227258"
    result <- content(GET(url), "parsed")$data
    names  <- c("Name", "z_score", "post_prob", "func_con", "meta_analytic")
    df     <- do.call(rbind, lapply(result, function(x) setNames(as.data.frame(x), names)))
    df$z_score <- as.numeric(df$z_score)
    #> Warning: NAs introduced by coercion
    df <- df[order(-df$z_score), ]
    

    Now we have the data in a nice data frame:

    head(df)
    #>         Name z_score post_prob func_con meta_analytic
    #> 760       mm    8.78      0.86     0.15          0.52
    #> 509    gamma    8.10      0.85     0.19          0.63
    #> 1135 sources    6.46      0.77     0.10          0.32
    #> 825    noise    5.33      0.73     0.00          0.08
    #> 671  lesions    4.66      0.72    -0.01          0.00
    #> 1137 spatial    4.57      0.63    -0.15          0.00
    

    And we have all the data:

    nrow(df)
    #> [1] 1334
    

    Created on 2022-02-23 by the reprex package (v2.0.1)