c++machine-learning text-processing text-extraction lda

bag-of-words approach / tools / library for C++?

I have a folder that contains many document in .txt of tourism reviews. I want to use the bag of words approach to convert them to some kind of numeric representation for machine learning (Latent Dirichlet Allocation - LDA) in c++ to train the system in recognizing the topic for each document.

But somehow I do not know what to do with Bag of Word algorithm's, and i heard some tools like Scikit-learn. but Scikit-learn work in python environment. I'm wondering, is there some recommendation tools / library that can help me to solve my bag of words modul's? Or is there a C++ wrapper over scikit-learn for C++?

i have come at a level where I don't know what to do, some guidance would be appreciated. thank you :)

Solution

Umm... surely it should be easy enough to code?

The stupidest, yet guaranteed to work, approach will be to iterate over all the documents twice. During the first iteration, create a hashmap of the words and a unique index (a structure like HashMap), and during the second iteration, you do a table lookup and print the word's index to create a numerical representation of the data.

If you want a bag of words representation, during the second iteration, you could create a hashmap (HashMap) every time you see a new document, and incremement the counts of each word index, and once you reach the end of a document, you read out the counts, and print them.