Search code examples
rtokenizetidyversetmtidytext

Combining .txt files with character data into a data frame for tidytext analysis


I have bunch of .txt files of Job Descriptions and I want to import them to do text mining analyses.

Please find attached some sample text files: https://sample-videos.com/download-sample-text-file.php. Please use the 10kb and 20kb versions because the job descriptions are different lengths.

After combining them, I would like to do tidy text analyses and create document term matrices.

What I have done thus far:

file_list <- list.files(pattern="*.txt")
list_of_files <- lapply(file_list, read.delim)
mm<- merge_all(list_of_files) # this line doesn't work because the column headers of the lists are different
## Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

I would appreciate an answer that either helps me merge these lists into a data frame OR tells me a better way to import these text files OR sheds light on how to do tidy text analysis on lists rather than data frames.

Thanks!


Solution

  • One approach could be using dplyr package and a for loop to import each file and combine together as a dataframe with filename and paragraph number used to index, then using tidytext to tidy up:

    #install.packages(c("dplyr", "tidytext"))
    library(dplyr)
    library(tidytext)
    
    file_list <- list.files(pattern="*.txt")
    
    texts <- data.frame(file=character(),
                        paragraph=as.numeric(),
                        text=character(),
                        stringsAsFactors = FALSE) # creates empty dataframe
    
    for (i in 1:length(file_list)) {
      p <- read.delim(file_list[i],
                      header=FALSE,
                      col.names = "text",
                      stringsAsFactors = FALSE) # read.delim here is automatically splitting by paragraph
      p <- p %>% mutate(file=sub(".txt", "", x=file_list[i]), # add filename as label
                        paragraph=row_number()) # add paragraph number
      texts <- bind_rows(texts, p) # adds to existing dataframe
    }
    
    words <- texts %>% unnest_tokens(word, text) # creates dataframe with one word per row, indexed
    

    Your final output would then be:

    head(words)
                       file paragraph        word
    1   SampleTextFile_10kb         1       lorem
    1.1 SampleTextFile_10kb         1       ipsum
    1.2 SampleTextFile_10kb         1       dolor
    1.3 SampleTextFile_10kb         1         sit
    1.4 SampleTextFile_10kb         1        amet
    1.5 SampleTextFile_10kb         1 consectetur
    ...
    

    Is this what you're looking for for your next stages of analysis?