Search code examples
rvariablesspecial-characterstibblesuperscript

using special characters such as superscript in tibble variable names


A tibble can accept variable names with some special characters. This is a simple example:

library(tibble)
df<-tibble(
  `Age, years` = c(25, 26, 27, 29), 
  `BMI, kg/m^2` = c(21, 23, 24, 25)
)

This is convenient for generating tables and plots. I am not been able to work out to use true superscript as in BMI, kg/m^2 directly. I can add a label expression(paste("BMI, kg/m"^"2")) or quote("BMI, kg/m"^"2") in ggplot to show superscript in graphs. I feel that directly adding it to variable names would be more convenient for producing both graphs and tables. Is this possible? Thanks.


Solution

  • Including special characters in symbol and column names is generally a bad idea. Think instead in terms of data manipulation vs. presentation. In the former, you could suffice with age and bmi, which then gives you full flexibility to format the labels in the presentation.

    Special characters and "true superscript" is not straightforward and depends entirely on the medium you are working with. Superscript 2 (²) may or may not be presented correctly in HTML, pdf and plots, and is entirely dependant on the used encoding (example).

    If you need to output a superscript 2 in e.g. a plot, you could use plot(1,1, ylab=expression(kg/m^2)), but takes some steps to includes spaces. What happens if you need to break the label and units into e.g. two lines? You wouldn't want to change the naming of your data structure to reflect a label in the plot.

    HTML? Might be safer with HTML entities (&sup2;), but you might need to do escaping with whatever is parsing your output. Again, separate the label from the data structure.

    PDF via LaTeX, $\frac{kg}{m^2}$. Separate the presentation from the variable names in the data structure.

    Having the labels directly encoded in the variable names is a generally a bad idea. If not convinced yet, consider you did name your column "BMI, kg/m"^"2" and then used quote whenever you need it passed. During your entire data analysis, you will have to type out that name to refer to that column. Need to format number of digits?

    df$`"BMI, kg/m"^"2"` <- format(df$`"BMI, kg/m"^"2"`, digits=1)
    etc. etc. etc.
    

    And then, after you've finished your report, your boss tells you to present the units in a one-line fashion, i.e. kg•m^-2.