Search code examples
rlistdataframerow-number

keeping the row number in a data frame column


I have a bunch of .txt files (articles) in a folder, I use a for cycle in order to get text from all of them on R

input_loc <- "C:/Users/User/Desktop/Folder"
files <- dir(input_loc, full.names = TRUE)
text <- c()
for (f in files) {
  text <- c(text, paste(readLines(f), collapse = "\n"))
}

from here, I tokenize per paragraphs and I get each paragraph in each article:

paragraphs <- tokenize_paragraphs(text)
sapply(paragraphs, length)
paragraphs

then I unlist and transform into a dataframe

par_unlisted<-unlist(paragraphs)
par_unlisted
par_unlisted_df<-as.data.frame(par_unlisted)

BUT doing that I no longer have an inter-article separation of paragraph numbers (e.g. first article has 6 paragraphs, before unlisting the first paragraph of the second article would still have a [1] in front, while after unlisting it will have a [7]). What I would like to do is, once I have the dataframe, having a column with the number of the paragraph, then create another column named "article" with the number of the article. Thank You in advance

EDIT this is roughly what I get once I get to paragraphs:

> paragraphs
[[1]]
[1] "The Miami Dolphins have decided to use their non-exclusive franchise 
tag on wide receiver Jarvis Landry."                                                                                                                                                                                                                                         

[2] "The Dolphins tweeted the announcement Tuesday, the first day teams 
could use their franchise or transition tags. The salary for wide receivers 
getting the franchise tag this offseason is expected to be around $16.2 
million, which will be quite the raise for Landry, who made $894,000 last 
season."    
[[2]]
[1] "Despite months of little-to-no movement on contract negotiations, 
Jarvis Landry has often stated his desire to stay in Miami."                                                                                                                                                                                                                                                                                                  

[2] "The Dolphins used their lone tool to wipe away negotation-driven stress 
-- at least in the immediate future -- and ensure Landry won't be lured away 
from Miami, placing the franchise tag on the receiver on Tuesday, the team 
announced."     

I would want to keep the paragraph number ([n]) as a column in the dataframe, because when I unlist them they no longer stay separated per article and then per paragraph, but I get them in sequence, let's say (basically in the example I've just posted I no longer have

[[1]]
[1] ...
[2] ...

[[2]]
[1] ...
[2] ...

but I get

[1] ...
[2] ...
[3] ...
[4] ...                                                                 

Solution

  • Consider iterating through the paragraphs list and build a list of dataframes with needed article and paragraph numbers with a final row bind through all dataframe elements.

    Input Data

    paragraphs <- list(
         c("The Miami Dolphins have decided to use their non-exclusive franchise tag on wide receiver Jarvis Landry.",   
            "The Dolphins tweeted the announcement Tuesday, the first day teams could use their franchise or transition tags. The salary for wide receivers 
    getting the franchise tag this offseason is expected to be around $16.2 million, which will be quite the raise for Landry, who made $894,000 last 
    season."),
         c("Despite months of little-to-no movement on contract negotiations, Jarvis Landry has often stated his desire to stay in Miami.",
          "The Dolphins used their lone tool to wipe away negotation-driven stress -- at least in the immediate future -- and ensure Landry won't be lured away 
    from Miami, placing the franchise tag on the receiver on Tuesday, the team announced."))
    

    Dataframe Build

    df_list <- lapply(seq_along(paragraphs), function(i)
    
      setNames(data.frame(i, 1:length(paragraphs[[i]]), paragraphs[[i]]), 
               c("article_num", "paragraph_num", "paragraph"))      
    )
    
    final_df <- do.call(rbind, df_list)
    

    Output Result

    final_df
    
    #   article_num paragraph_num                                             paragraph
    # 1           1             1 The Miami Dolphins have decided to use their non-e...
    # 2           1             2 The Dolphins tweeted the announcement Tuesday, the...
    # 3           2             1 Despite months of little-to-no movement on contrac...
    # 4           2             2 The Dolphins used their lone tool to wipe away neg...