I have bunch of .txt files of Job Descriptions and I want to import them to do text mining analyses.
Please find attached some sample text files: https://sample-videos.com/download-sample-text-file.php. Please use the 10kb and 20kb versions because the job descriptions are different lengths.
After combining them, I would like to do tidy text analyses and create document term matrices.
What I have done thus far:
file_list <- list.files(pattern="*.txt")
list_of_files <- lapply(file_list, read.delim)
mm<- merge_all(list_of_files) # this line doesn't work because the column headers of the lists are different
## Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
I would appreciate an answer that either helps me merge these lists into a data frame OR tells me a better way to import these text files OR sheds light on how to do tidy text analysis on lists rather than data frames.
Thanks!
One approach could be using dplyr
package and a for
loop to import each file and combine together as a dataframe with filename and paragraph number used to index, then using tidytext
to tidy up:
#install.packages(c("dplyr", "tidytext"))
library(dplyr)
library(tidytext)
file_list <- list.files(pattern="*.txt")
texts <- data.frame(file=character(),
paragraph=as.numeric(),
text=character(),
stringsAsFactors = FALSE) # creates empty dataframe
for (i in 1:length(file_list)) {
p <- read.delim(file_list[i],
header=FALSE,
col.names = "text",
stringsAsFactors = FALSE) # read.delim here is automatically splitting by paragraph
p <- p %>% mutate(file=sub(".txt", "", x=file_list[i]), # add filename as label
paragraph=row_number()) # add paragraph number
texts <- bind_rows(texts, p) # adds to existing dataframe
}
words <- texts %>% unnest_tokens(word, text) # creates dataframe with one word per row, indexed
Your final output would then be:
head(words)
file paragraph word
1 SampleTextFile_10kb 1 lorem
1.1 SampleTextFile_10kb 1 ipsum
1.2 SampleTextFile_10kb 1 dolor
1.3 SampleTextFile_10kb 1 sit
1.4 SampleTextFile_10kb 1 amet
1.5 SampleTextFile_10kb 1 consectetur
...
Is this what you're looking for for your next stages of analysis?