Search code examples
rweb-scrapingrvest

Web Scrape your own Stack Overflow profile using R


I am currently experimenting with web scraping my own Stack Overflow profile (logout) using rvest. To find the CSS tags I use the SelectorGadget extension for google chrome. To start I would like to extract the numbers with headers under the Stats header of my profile which are marked as green and yellow (colors because of using the extension to find tag) in the picture below:

                                           enter image description here

This gives me the following CSS tags: .md\:fl-auto , .fc-dark. The .fc-dark tag is for the numbers and .md\:fl-auto for the headers (reputation, reached, etc.). Extracting the numbers works, but extracting the headers, I get the following error: Error: '\:' is an unrecognized escape in character string starting "".md\:". Is it possible to extract this CSS tag and save both outputs in a dataframe? Here is a reproducible example:

library(rvest)
library(dplyr)

link <- "https://stackoverflow.com/users/14282714/quinten"
profile <- read_html(link)

numbers <- profile %>% html_nodes(".fc-dark") %>% html_text()
numbers
[1] "12,688" "49k"    "847"    "9"     
headers <- profile %>% html_nodes(".md\:fl-auto") %>% html_text()
Error: '\:' is an unrecognized escape in character string starting "".md\:"
 

I am open to better options for web scraping my StackOverflow profile!


Solution

  • library(rvest)
    library(dplyr)
    library(stringr)
    profile %>% html_nodes(".md\\:fl-auto") %>% html_text() %>% 
      stringr::str_squish() %>% 
      as_tibble() %>% 
      tidyr::separate(value, into = c("number", "header"), sep = "\\s") %>% 
      mutate(number = stringr::str_remove(number, "\\,") %>% 
               sub("k", "000", ., fixed = TRUE))
    

    Output:

    # A tibble: 4 x 2
      number header    
       <dbl> <chr>     
    1  12688 reputation
    2  49000 reached   
    3    847 answers   
    4     10 questions