When I investigated the resulting dtm matrices, I discovered that tokens were lowercased unless the setting was set to False. Furthermore, words with underscores were split before tokenization.
When I looked up the documentation I was not able to retrieve the default settings, or what settings were used if no explicit control was provided.
Where can I find this?
The documentation for DocumentTermMatrix
says "see termFreq
for available local control options."
If you do:
?termFreq
you'll see all the possible options with the defaults (which includes the "Defaults to tolower
" you are referring to).