Search code examples
rweb-scrapingnlp

Paragraph indentation of a column in a dataframe


I was trying webscraping, and managed to get the headlines and detailed content story of the news. Code for the same is:

webpage <- read_html("https://www.rediff.com/sports")

headlines.node <- html_nodes(webpage,'.relative h2 a')
headlines <- html_text(headlines.node)
headlines <- str_squish(headlines)

links <- webpage %>% html_nodes(".relative h2 a") %>% 
html_attr("href")

content <- c()
for(i in 1:length(links)){
  newslink <- links[i]
  webpage <- read_html(newslink)
  story.node <- html_nodes(webpage, "p")
  story <-  html_text(story.node)
  story <- str_squish(story)
  content[i] <- paste(story, collapse = '')
}

df <- data.frame("Headlines"=headlines, "Main Content"=content)

However, In order to store the detailed content of the news in a dataframe field, I had to collapse the detailed content story of the page, without which it was returning: "In content[i] <- story : number of items to replace is not a multiple of replacement length"), as it was returning multiple rows of data of a paragraph.

The collapse argument created a column with data like:

df$Main.Content[1] 

The above line returned: [1] NewsApp (Free)Kohli gets 'Spirit of Cricket' gong; Stokes is ICC Cricketer of the Year.India's swashbuckling opener Rohit Sharma was on Wednesday named the ICC's 2019 ODI Cricketer of the Year for his incredible run of form, while English all-rounder Ben Stokes walked away with the overall honours.Indian skipper Virat Kohli was named captain of both the ICC's Test and ODI teams of the year besides winning the 'Spirit of Cricket' award for trying to stop fans from booing Steve Smith during a World Cup match at the Oval. Smith was returning to international cricket from a one-year suspension for ball-tampering at that time.England's World Cup-winning all-rounder Stokes got the biggest prize -- the 'Sir Garfield Sobers Trophy' for Player of the Year, while Australia fast bowler Pat Cummins was named the Test Player of the Year.India seamer Deepak Chahar won the T20 International Performance of the Year, Australia's Marnus Labuschagne was named as Emerging Cricketer of the Year, while Scotland's Kyle Coetzer was declared the Associate Cricketer of the Year.The 32-year-old . . . . . . . . and the remaining story (not copying the complete thing here. . . )

We lost the paragraph indentation and the text looks messy. Is there any way that we can maintain the paragraph indentations of each link and store it in a field of a dataframe?

Example: like when i hit

df$Main.Content[1]

It should return me a clean paragraph indented text as:

NewsApp (Free)Kohli gets 'Spirit of Cricket' gong; Stokes is ICC Cricketer of the Year.India's swashbuckling opener Rohit Sharma was on Wednesday named the ICC's 2019 ODI Cricketer of the Year for his incredible run of form, while English all-rounder Ben Stokes walked away with the overall honours.

Indian skipper Virat Kohli was named captain of both the ICC's Test and ODI teams of the year besides winning the 'Spirit of Cricket' award for trying to stop fans from booing Steve Smith during a World Cup match at the Oval. Smith was returning to international cricket from a one-year suspension for ball-tampering at that time.

(and so on. . . as in the original page)

I tried to best explain my requirement. Please ask if something is unclear about the question.


Solution

  • One way would be to collapse the story with new line character

    library(rvest)
    
    for(i in 1:length(links)){
      newslink <- links[i]
      webpage <- read_html(newslink)
      story.node <- html_nodes(webpage, "p")
      story <-  html_text(story.node)
      story <- str_squish(story)
      content[i] <- paste(story, collapse = '\n\n')
    }
    
    df <- data.frame(Headlines=headlines,Main_Content=content, stringsAsFactors = FALSE)
    

    and then view the text with cat

    cat(df$Main_Content[1])
    
    #Diagnosed with a concussion, wicketkeeper Rishabh Pant will not travel 
    #with #the Indian team to Rajkot for the second ODI against Australia.
    
    #Pant didn't take the field for the second half of the first ODI in 
    #Mumbai #on Tuesday after getting hit on the helmet while batting. He 
    #remains under #observation.
    
    #"Rishabh Pant will not be travelling to Rajkot today with other 
    #members. He will join the team later," a BCCI source told PTI.
    
    #"Normally 24 hours is the time to keep someone who has suffered concussion 
    #under observation," he added.
    #....