Search code examples
rweb-scrapingrvestremoving-whitespace

Values are not getting entered in dataframe from web scraping


My main aim is to extract the content from the website. I want to save it locally. After the content should get updated in website it should reflect the local data also.
I am able to read the data from the webpage used in the code,now I want to save the result into data frame so that I can export the result. I want the values of x6 should enter into the data frame df ,so that I can export the data frame result into text file or excel file or you can suggest any other way to extract the data from the webpage used in the code (web scraping).In this I want my for loop is not working ,so please anyone help me out.

library(rvest)
library(dplyr)
library(qdapRegex) # install.packages("qdapRegex")

google <- read_html("https://bidplus.gem.gov.in/bidresultlists")

(x <- google %>%
  html_nodes(".block") %>%
  html_text())

class(x)

(x1 <- gsub("                                                            ", "", x))
(x2 <- gsub("                                                        ", "", x1))
(x3 <- gsub("            ", "", x2))
(x4 <- gsub("    ", "", x3))
(x5 <- gsub("  ", "", x4))
(x6 <- gsub("\n", "", x5))

class(x6)
length(x6[i])
typeof(x6)

for (i in x6) {
  
  BIDNO <- rm_between(x6[i], "BID NO:", "Status", extract = TRUE)
  Status <- rm_between(x6[i], "Status:", "Quantity Required", extract = TRUE)
  Quantity_Required <- rm_between(x6[i], "Quantity Required:", "Department Name And Address", extract = TRUE)
  Department_Name_And_Address <- rm_between(x6[i], "Department Name And Address:", "Start Date", extract = TRUE)
  Start_Date <- rm_between(x6[i], "Start Date:", "End Date", extract = TRUE)
  # End_Date <- rm_between(x6[i], "End Date: ", "Technical Evaluation", extract=TRUE)

  df <- data.frame("BID_NO", "Status", "Quantity_Required", "Department_Name_Address", "Start_Date")
}

df

View(df)

Solution

  • Targeting the desired elements with XPath is likely a path with less frustration & error:

    library(rvest)
    library(dplyr)
    
    pg <- read_html("https://bidplus.gem.gov.in/bidresultlists")
    

    Get all the bid blocks:

    blocks <- html_nodes(pg, ".block")
    

    Target items & quantity div:

    items_and_quantity <- html_nodes(blocks, xpath=".//div[@class='col-block' and contains(., 'Item(s)')]")
    

    Pull out items and quantities:

    items <- html_nodes(items_and_quantity, xpath=".//strong[contains(., 'Item(s)')]/following-sibling::span") %>% html_text(trim=TRUE)
    quantity <- html_nodes(items_and_quantity, xpath=".//strong[contains(., 'Quantity')]/following-sibling::span") %>% html_text(trim=TRUE) %>% as.numeric()
    

    Get department name and address. Modify it so the three lines are separated with pipes (|). This will enable separation at a later time. Pipe symbol is a pain for regex since it has to be escaped but it is highly unlikely to appear in the text and tabs can often cause confusion at a later time.

    department_name_and_address <- html_nodes(blocks, xpath=".//div[@class='col-block' and contains(., 'Department Name And Address')]") %>% 
      html_text(trim=TRUE) %>% 
      gsub("\n", "|", .) %>% 
      gsub("[[:space:]]*\\||\\|[[:space:]]*", "|", .)
    

    Target the block header which has bid # and status:

    block_header <- html_nodes(blocks, "div.block_header")
    

    Pull out bid # (see note at the end of the answer):

    html_nodes(block_header, xpath=".//p[contains(@class, 'bid_no')]") %>%
      html_text(trim=TRUE) %>% 
      gsub("^.*: ", "", .) -> bid_no
    

    Pull out status:

    html_nodes(block_header, xpath=".//p/b[contains(., 'Status')]/following-sibling::span") %>% 
      html_text(trim=TRUE) -> status
    

    Target & pull out start & end dates:

    html_nodes(blocks, xpath=".//strong[contains(., 'Start Date')]/following-sibling::span") %>%
      html_text(trim=TRUE) -> start_date
    
    html_nodes(blocks, xpath=".//strong[contains(., 'End Date')]/following-sibling::span") %>%
      html_text(trim=TRUE) -> end_date
    

    Make a data frame:

    data.frame(
      bid_no,
      status,
      start_date,
      end_date,
      items,
      quantity,
      department_name_and_address,
      stringsAsFactors=FALSE
    ) -> xdf
    

    Some of the bids are "RA"s so we can also create a column letting us know which ones are which:

    xdf$is_ra <- grepl("/RA/", bid_no)
    

    The resultant data frame:

    str(xdf)
    ## 'data.frame': 10 obs. of  8 variables:
    ##  $ bid_no                     : chr  "GEM/2018/B/93066" "GEM/2018/B/93082" "GEM/2018/B/93105" "GEM/2018/B/93999" ...
    ##  $ status                     : chr  "Not Evaluated" "Not Evaluated" "Not Evaluated" "Not Evaluated" ...
    ##  $ start_date                 : chr  "25-09-2018 03:53:pm" "27-09-2018 09:16:am" "25-09-2018 05:08:pm" "26-09-2018 05:21:pm" ...
    ##  $ end_date                   : chr  "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" "18-10-2018 03:00:pm" ...
    ##  $ items                      : chr  "automotive chassis fitted with engine" "automotive chassis fitted with engine" "automotive chassis fitted with engine" "Storage System" ...
    ##  $ quantity                   : num  1 1 1 2 90 1 981 6 4 376
    ##  $ department_name_and_address: chr  "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Ministry Of Steel Na Kirandul Complex N/a" "Department Name And Address:||Maharashtra Energy Department Maharashtra Bhusawal Tps N/a" ...
    ##  $ is_ra                      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
    

    I'll let you turn dates into POSIXct elements.

    The contiguous code w/o explanation is here.

    Also, this isn't Java. for loops are rarely the solution to a problem in R. And, you should read up on regexes since counting spaces for substitution is also a path fraught with peril and frustration.