I have a folder that contains many document in .txt of tourism reviews. I want to use the bag of words approach to convert them to some kind of numeric representation for machine learning (Latent Dirichlet Allocation - LDA) in c++ to train the system in recognizing the topic for each document.
But somehow I do not know what to do with Bag of Word algorithm's, and i heard some tools like Scikit-learn. but Scikit-learn work in python environment. I'm wondering, is there some recommendation tools / library that can help me to solve my bag of words modul's? Or is there a C++ wrapper over scikit-learn for C++?
i have come at a level where I don't know what to do, some guidance would be appreciated. thank you :)
Umm... surely it should be easy enough to code?
The stupidest, yet guaranteed to work, approach will be to iterate over all the documents twice. During the first iteration, create a hashmap of the words and a unique index (a structure like HashMap), and during the second iteration, you do a table lookup and print the word's index to create a numerical representation of the data.
If you want a bag of words representation, during the second iteration, you could create a hashmap (HashMap) every time you see a new document, and incremement the counts of each word index, and once you reach the end of a document, you read out the counts, and print them.