Search code examples
rdataframecsvlapplysapply

Create a data.frame from multiple textfiles with rownames as columns in r


A am a Newby in r and I already fail at reading in my files.

I have a list of 1100 .txt-files. The first 4 rows are the metadata ("Newspaper", "Date", "Ressort", "Title") The Text begins in the fith row.

PROBLEM I don't get the data.frame done. I appears as a loop of my first .txt-file.

So, this is what I tried

I read them in r with list.files() and write a for-loop

datalist <- list.files()

for(i in datalist){
  test <- readLines(i, encoding = 'UTF-8')
}

The first file ist the test-file

test <- readLines(i, encoding = 'UTF-8')

The test-file gives me the metadata

meta <- test[1:4]

Then I define the 5th row as text and remove the line breaks

text <- paste(test[5:length(test)], collapse = '')

Then I create my data.frame with the meta as columns and the text

df <- data.frame(datalist, Newspaper = meta[1], Date = meta[2], Resssort = meta[3], Text = text)
df

Writing as csv - sure

write.csv(df, "test.csv")

The Problem is now, that my columns are well set, but in every line the same data appears and it is the data from test in the for-loop. Any Ideas? Would be so pleased and grateful to get some tipps or answers! Cheers Y'all


Solution

  • A possible solution using {purrr}'s map_dfr to map a list of file names to a (custom) function to read the data. The main advantages of such a solution is that you don't have to create a list of data frames to merge them together afterwards and by avoiding the loop you don't have to create temporary objects that would clutter your working environment. All objects created within the function only live within the function.

    The disadvantage is that it might be harder in the beginning to understand what goes on behind the scenes, while writing a for loop, every step is more explicit. If you have the time, I encourage you to take a time to The Joy of Functional Programming (for Data Science) video from Hadley Wickham. At around 8 minutes onwards he talks about exactly this kind of problem you are facing. But the whole video is worth its time! :)

    library(tidyverse)
    datalist <- list.files("data/newspaper", full.names = T)
    
    custom_read_lines <- function(file) {
      # define function to read files and return a data frame already as expected
      # in final output
      whole_file <- readLines(file)
      text <- paste(whole_file[5:length(whole_file)], collapse = '')
    
      df <- data.frame(
        Newspaper  = whole_file[1],
        Date       = whole_file[2],
        Ressort    = whole_file[3],
        Title      = whole_file[4],
        Text       = text
      )
    
      return(df)
    }
    
    ## using purrr's map_dfr to map each entry of data list to the custom function
    df_merged <- datalist %>% map_dfr(custom_read_lines)
    
    df_merged %>% as_tibble() #just for nicer output
    # A tibble: 3 x 5
    # Newspaper   Date      Ressort Title             text
    # <chr>       <chr>     <chr>   <chr>             <chr>
    # 1 Nice News   2021-01-… ressort Where does it co… "Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of …
    # 2 Boring News 2020-11-… ressort Why do we use it? "It is a long established fact that a reader will be distracted by the readable content of a pa…
    # 3 Old News    1990-01-… ressort What is Lorem Ip… "Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has bee…
    
    
    
    # in case your data looks a bit different and you want to try out first you can always try
    # the function for a single file, if it works you can then pass it to the map_dfr function
    
    first_file <- datalist[2]
    custom_read_lines(first_file) %>% as_tibble()
    # A tibble: 1 x 5
    # Newspaper   Date      Ressort Title         text
    # <chr>       <chr>     <chr>   <chr>         <chr>
    #   1 Boring News 2020-11-… ressort Why do we us… It is a long established fact that a reader will be distracted by the readable content of a page wh…
    
    

    each of the 3 examples files looked more or less like the following:

    Nice News
    2021-01-01
    ressort
    Where does it come from?
    Contrary to popular belief, Lorem Ipsum is not simply random text.
    It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.
    Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source.