Search code examples
rtokenizetm

What are the default control settings for the function DocumentTermMarix from the package tm?


When I investigated the resulting dtm matrices, I discovered that tokens were lowercased unless the setting was set to False. Furthermore, words with underscores were split before tokenization.

When I looked up the documentation I was not able to retrieve the default settings, or what settings were used if no explicit control was provided.

Where can I find this?


Solution

  • The documentation for DocumentTermMatrix says "see termFreq for available local control options."

    If you do:

    ?termFreq
    

    you'll see all the possible options with the defaults (which includes the "Defaults to tolower" you are referring to).