Search code examples
rmlr3

How to change column names to comply to mlr3's naming convention


I want to perform a text-classfication with many (>50K) tokens as feature names. However the Task() functions in mlr3 do not allow many characters in column names, which are passed by make.names and are otherwise fine. Here is a list of them that I found so far:

  mutate(token=str_replace(token, "à", "a")) %>% 
    mutate(token=str_replace(token, "ã", "a")) %>%  
    mutate(token=str_replace(token, "á", "a")) %>% 
    mutate(token=str_replace(token, "ø", "o")) %>%
    mutate(token=str_replace(token, "ç", "c")) %>%
    mutate(token=str_replace(token, "ô", "o")) %>%
    mutate(token=str_replace(token, "é", "e")) %>% 
    mutate(token=str_replace(token, "é", "e")) %>%   
    mutate(token=str_replace(token, "í", "i")) %>% 
    mutate(token=str_replace(token, "î", "i")) %>% 
    mutate(token=str_replace(token, "è", "e")) %>% 
    mutate(token=str_replace(token, "ë", "e")) %>% 
    mutate(token=str_replace(token, "å", "a")) %>%  
    mutate(token=str_replace(token, "â", "a")) %>%  
    mutate(token=str_replace(token, "æ", "a")) %>%  
    mutate(token=str_replace(token, "ñ", "n")) %>%  

How do I make my data.frame compatible with mlr3, without manually replacing all special characters this way (trial and error)? make.names() does obviously not work!

I would very much appreciate some help :) Thanks!


Solution

  • One way to do it is to use janitor::clean_names()

    d <- data.frame(`süßigkeit` = 1:3, `straße` = 1:3, `Hellö` = 1:3, `séé` = 1:3)
    janitor::clean_names(d)
    #>   sussigkeit strasse hello see
    #> 1          1       1     1   1
    #> 2          2       2     2   2
    #> 3          3       3     3   3
    

    Created on 2021-01-11 by the reprex package (v0.3.0)

    If you're processing a vector, not names of a data.frame, you could use the underlying function janitor::make_clean_names() :

    make_clean_names("süßigkeit")
    [1] "sussigkeit"