Search code examples
rgoogle-chromesearchgoogle-search

How do I build a search query that gets me data through google search?


I have a project that I'm working on where I need to pull data pertaining to specific parks in Florida. For example, my question on this post is about how I can program R to go through a search query through google to get areas. When I type in "area of wekiva springs state park in hectares" into google search I get an actual value from the top of the page "2,833 hectares). Now I have a list of 52 parks:

structure(list(`unique(df$ParkName)` = structure(c(14L, 47L, 
39L, 12L, 9L, 20L, 5L, 10L, 25L, 28L, 36L, 30L, 31L, 43L, 4L, 
35L, 44L, 48L, 51L, 6L, 21L, 32L, 38L, 42L, 1L, 41L, 27L, 45L, 
46L, 50L, 18L, 37L, 24L, 26L, 13L, 52L, 15L, 2L, 17L, 11L, 22L, 
34L, 49L, 16L, 40L, 7L, 8L, 29L, 33L, 3L, 23L, 19L), .Label = c("Alafia River State Park", 
"Amelia Island State Park", "Big Cypress National Park", "Big Talbot Island State Park", 
"Bill Baggs Cape Florida State Park", "Blue Spring State Park", 
"Caladesi Island State Park", "Cayo Costa State Park", "Collier-Seminole State Park", 
"Curry Hammock State Park", "Dade Battlefield Historic State Park", 
"De Leon Springs State Park", "Delanor-Wiggins Pass State Park", 
"Fakahatchee Strand Preserve State Park", "Faver-Dykes State Park", 
"Fort Cooper State Park", "Fort George Island Cultural State Park", 
"Fort Pierce Inlet State Park/Avalon State Park", "Fort Zachary Taylor Historic State Park", 
"Highlands Hammock State Park", "Hillsborough River State Park", 
"Honeymoon Island State Park", "Hugh Taylor Birch State Park", 
"John D. MacArthur Beach State Park", "John Pennekamp Coral Reef State Park/Key Largo Hammocks", 
"John U. Lloyd Beach State Park", "Jonathan Dickinson State Park", 
"Key Largo Hammocks", "Koreshan State Historic Site", "Lake Griffin State Park", 
"Lake Kissimmee State Park", "Lake Manatee State Park", "Lake Wales Ridge Geopark", 
"Little Manatee River State Park", "Little Talbot Island State Park", 
"Long Key State Park", "Lovers Key State Park", "Myakka River State Park", 
"Ocala National Forest", "Oleta River State Park", "Oscar Scherer State Park", 
"Paynes Creek Historic State Park", "Paynes Prairie Preserve State Park", 
"Pumpkin Hill Creek Preserve State Park", "Savannas Preserve State Park", 
"Seabranch Preserve State Park", "Sebastian Inlet State Park", 
"Talbot Islands State Parks", "Terra Ceia Preserve State Park", 
"Tosohatchee Wildlife Management Area", "Washington Oaks Gardens State Park", 
"Werner-Boyce Salt Springs State Park"), class = "factor")), .Names = "unique(df$ParkName)", row.names = c(NA, 
-52L), class = "data.frame")

I could manually go and type every single park name in the google search bar but I really want to figure out how to build a search query for this so I can apply it to future projects. The problem is when it comes to building anything this complicated I'm kind of at a loss. I've only recently begun to start learning about what things like "APIs" are etc.

Any help would be appreciated.


Solution

  • to make web scraping use the rvest package, the results depend a lot on each query because not all can return the value at the top of the page.

    library(rvest)
    
    
     parks <- data.frame(name = c("wekiva springs state park", "cayo costa 
                     state park"))
    
      url  <- "http://www.google.com"
    
      s <- html_session(url)
      search <- html_form(s)[[1]]
      for(i in 1:dim(parks)[1]){
        query <- paste("area of",parks[i,1], "in hectares")
        a <- set_values(search, q = query)
    
        session <- submit_form(s, a) 
        s1 <- html_nodes(session, "#res")
        result <- html_text(s1)
    
        parks$area[i] <- gsub("([A-Za-z]+).*", "\\1", result)
      }
    
      parks
    
                        name     area
    1 wekiva springs state park 2.833 ha
    2     cayo costa state park 1.014 ha 
    

    To learn a little about rvest, here's a good place to start