I'm having struggle with scrapping a website with rvest, specially
here is the page I'm practicing on : https://www.edx.org/professional-certificate/harvardx-computer-science-for-artifical-intelligence
CS50's Introduction to Computer Science 6–18 hours per week, for 12 weeks An introduction to the intellectual enterprises of computer science and the art of programming.
CS50's Introduction to Artificial Intelligence with Python 10–30 hours per week, for 7 weeks Learn to use machine learning in Python in this introductory course on artificial intelligence. Job Outlooks
....
here is my code :
> html_nodes(page, xpath='//*[@id="main-content"]/div[3]/div/div[3]/div/div/div/ol') %>% html_text()
[1] "HarvardX's Computer Science for Artificial Intelligence Professional Certificate CS50's Introduction to Computer Science6–18 hours per week, for 12 weeksAn introduction to the intellectual enterprises of computer science and the art of programming.View the course CS50's Introduction to Artificial Intelligence with Python10–30 hours per week, for 7 weeksLearn to use machine learning in Python in this introductory course on artificial intelligence.View the courseJob OutlookEmployment of software developers is projected to grow 24% from 2016 to 2026, much faster than the average for all occupations. (source: Occupational Outlook Handbook)The median pay for software developers in the U.S. in 2018 was $105,590 per year. (source: Occupational Outlook Handbook)"
When i try to use the xpath element '//*[@id="main-content"]/div[3]/div/div[3]/div/div/div/ol/li[2]' I'm only getting one of them :
html_nodes(page, xpath='//*[@id="main-content"]/div[3]/div/div[3]/div/div/div/ol/li[2]') %>% html_text()
[1] " CS50's Introduction to Computer Science6–18 hours per week, for 12 weeksAn introduction to the intellectual enterprises of computer science and the art of programming.View the course"
Do you know of I can do it without having to specify the three xpath independently?`I can't find a way
I think you're after the text inside the 3 collapsible sections. If so:
library(rvest)
library(tidyverse)
url <- "https://www.edx.org/professional-certificate/harvardx-computer-science-for-artifical-intelligence"
page <- read_html(url)
page %>%
html_nodes("div.path-details") %>%
.[2:4] %>%
html_text
# [1] "CS50's Introduction to Computer Science6–18 hours per week, for 12 weeksAn introduction to the intellectual enterprises of computer science and the art of programming.View the course"
# [2] "CS50's Introduction to Artificial Intelligence with Python10–30 hours per week, for 7 weeksLearn to use machine learning in Python in this introductory course on artificial intelligence.View the course"
# [3] "Job OutlookEmployment of software developers is projected to grow 24% from 2016 to 2026, much faster than the average for all occupations. (source: Occupational Outlook Handbook)The median pay for software developers in the U.S. in 2018 was $105,590 per year. (source: Occupational Outlook Handbook)"