I have used gensim.utils.simple_preprocess(str(sentence) to create a dictionary of words that I want to use for topic modelling. However, this is also filtering important numbers (house resolutions, bill no, etc) that I really need. How did I overcome this? Possibly by replacing digits with their word form. How do i go about it, though?
You don't have to use simple_preprocess()
- it's not doing much, it's not that configurable or sophisticated, and typically the other Gensim algorithms just need lists-of-tokens.
So, choose your own tokenization - which in some cases, depnding on your source data, could be as simple as a .split()
on whitespace.
If you want to look at what simple_preprocess()
does, as a model, you can view its Python source at: