Search code examples
rspecial-charactersdata-cleaning

What's the best way to work with datasets that contain special characters in their column names in R?


I am working with some large datasets that contain special characters in their column names. The column names look something like: "@c_age1619_da * ((df.age >= 16) & (df.age <= 19))" or "sovtoll_available == False". What would be the best way to work with these names? Should I keep the names as they are or rename them to more R-friendly names? When I call them in cases like df$value, R mistakenly interprets the column name as a function!


Solution

  • The only advantage to keeping the non-standard names is if you want to use those as labels in a plot or table or something. But it will make it very hard to work with the data, and those names could be reintroduced as labels later. You can use non-standard names by putting them in backticks, e.g.,

    df$`@c_age1619_da`
    

    Some editors (like RStudio) will correctly auto-complete these non-standard names, making them somewhat easier to work with, but still not as nice as standard names.

    Renaming them to standard names is generally better. Many functions that read-in data will do this automatically. You can use the make.names function to convert the non-standard names to standard names, mostly by replacing any special characters with .s. Like this:

    names(my_data) = make.names(names(my_data))
    

    But generally the best is to make meaningful names manually. sovtoll_available....False isn't very friendly name either, compared to something like sovtoll_unavailable.