I am attempting to write a Ruby script that will look at a collection of unstructured plain text files and I am struggling with thinking through the best way to process these files. The current working version of my script for topic modeling is the following:
#!/usr/bin/env ruby -w
require 'rubygems'
require 'lda-ruby'
# Input a directory of files
FILES_DIRECTORY = ARGV[0]
File.open("files.csv", "w") do |f|
Dir.glob(FILES_DIRECTORY + "*.txt") do |filename|
file_id = File.basename(filename).gsub(".txt", "")
text = File.read(filename).clean
f.puts [file_id, text].join(",")
end
end
# Read csv
file = File.open("files.csv", "r") { |f| f.read }
# Train topics and infer
corpus = Lda::Corpus.new
corpus.add_document(Lda::TextDocument.new(corpus, file))
lda = Lda::Lda.new(corpus)
lda.verbose = false
lda.num_topics = 20
lda.em('random')
topics = lda.top_words(10)
puts topics
What I'm attempting to modify is having this program read through a collection of plain text files rather than a single file. It's not as easy as just tossing all the text files into a single file (as it currently does with files.csv
) because, as I understand it, lda-ruby looks for multiple files to do a correct topic model rather than a single file. (I've come to this conclusion because there is little variance between having this script read a single text file [e.g., corpus.txt
] that includes all the text, and the files.csv
file.)
So, my question is how can I have lda-ruby iterate through these text files differently? Should the contents of the files be placed into a hash instead? If so, any pointers on where I should start with that? Or, should I scrap this and use a different LDA library?
Thanks ahead of time for any advice.
Basically, you just need to initialize the corpus before going through the directory and then add each file to the corpus in the block the same way you were previously adding your CSV file.
#!/usr/bin/env ruby -w
require 'rubygems'
require 'lda-ruby'
# Input a directory of files
FILES_DIRECTORY = ARGV[0]
corpus = Lda::Corpus.new
File.open("files.csv", "w") do |f|
Dir.glob(FILES_DIRECTORY + "*.txt") do |filename|
file = File.open(filename, "r") { |f| f.read }
corpus.add_document(Lda::TextDocument.new(corpus, file))
end
end
lda = Lda::Lda.new(corpus)
lda.verbose = false
lda.num_topics = 20
lda.em('random')
topics = lda.top_words(10)
puts topics
I know this is a rather old question, but I found this question while looking for a solution to a similar problem. Your code helped me so I thought my answer might be helpful to you or others.