Combining .txt files with character data into a data frame for tidytext analysis

I have bunch of .txt files of Job Descriptions and I want to import them to do text mining analyses.

Please find attached some sample text files: https://sample-videos.com/download-sample-text-file.php. Please use the 10kb and 20kb versions because the job descriptions are different lengths.

After combining them, I would like to do tidy text analyses and create document term matrices.

What I have done thus far:

file_list <- list.files(pattern="*.txt")
list_of_files <- lapply(file_list, read.delim)
mm<- merge_all(list_of_files) # this line doesn't work because the column headers of the lists are different
## Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

I would appreciate an answer that either helps me merge these lists into a data frame OR tells me a better way to import these text files OR sheds light on how to do tidy text analysis on lists rather than data frames.

Thanks!

Solution

One approach could be using dplyr package and a for loop to import each file and combine together as a dataframe with filename and paragraph number used to index, then using tidytext to tidy up:

#install.packages(c("dplyr", "tidytext"))
library(dplyr)
library(tidytext)

file_list <- list.files(pattern="*.txt")

texts <- data.frame(file=character(),
                    paragraph=as.numeric(),
                    text=character(),
                    stringsAsFactors = FALSE) # creates empty dataframe

for (i in 1:length(file_list)) {
  p <- read.delim(file_list[i],
                  header=FALSE,
                  col.names = "text",
                  stringsAsFactors = FALSE) # read.delim here is automatically splitting by paragraph
  p <- p %>% mutate(file=sub(".txt", "", x=file_list[i]), # add filename as label
                    paragraph=row_number()) # add paragraph number
  texts <- bind_rows(texts, p) # adds to existing dataframe
}

words <- texts %>% unnest_tokens(word, text) # creates dataframe with one word per row, indexed

Your final output would then be:

head(words)
                   file paragraph        word
1   SampleTextFile_10kb         1       lorem
1.1 SampleTextFile_10kb         1       ipsum
1.2 SampleTextFile_10kb         1       dolor
1.3 SampleTextFile_10kb         1         sit
1.4 SampleTextFile_10kb         1        amet
1.5 SampleTextFile_10kb         1 consectetur
...

Is this what you're looking for for your next stages of analysis?