How to deal with ggplot2 and overlapping labels on a discrete axis

ggplot2 does not seem to have a built-in way of dealing with overplotting for text on scatter plots. However, I have a different situation where the labels are those on a discrete axis and I'm wondering if someone here has a better solution than what I've been doing.

Some example code:

library(ggplot2)

#some example data
test.data = data.frame(text = c("A full commitment's what I'm thinking of",
                                "History quickly crashing through your veins",
                                "And I take A deep breath and I get real high",
                                "And again, the Internet is not something that you just dump something on. It's not a big truck."),
                       mean = c(3.5, 3, 5, 4),
                       CI.lower = c(4, 3.5, 5.5, 4.5),
                       CI.upper = c(3, 2.5, 4.5, 3.5))

#plot
ggplot(test.data, aes_string(x = "text", y = "mean")) +
  geom_point(stat="identity") +
  geom_errorbar(aes(ymax = CI.upper, ymin = CI.lower), width = .1) +
  scale_x_discrete(labels = test.data$text, name = "")

enter image description here

So we see that the x-axis labels are on top of each other. Two solutions spring to mind: 1) abbreviating the labels, and 2) adding newlines to the labels. In many cases (1) will do, but in some cases it cannot be done. So I wrote a function for adding newlines (\n) every n'th characters to the strings to avoid overlapping names:

library(ggplot2)

#Inserts newlines into strings every N interval
new_lines_adder = function(test.string, interval){
  #length of str
  string.length = nchar(test.string)
  #split by N char intervals
  split.starts = seq(1,string.length,interval)
  split.ends = c(split.starts[-1]-1,nchar(test.string))
  #split it
  test.string = substring(test.string, split.starts, split.ends)
  #put it back together with newlines
  test.string = paste0(test.string,collapse = "\n")
  return(test.string)
}

#a user-level wrapper that also works on character vectors, data.frames, matrices and factors
add_newlines = function(x, interval) {
  if (class(x) == "data.frame" | class(x) == "matrix" | class(x) == "factor") {
    x = as.vector(x)
  }

  if (length(x) == 1) {
    return(new_lines_adder(x, interval))
  } else {
    t = sapply(x, FUN = new_lines_adder, interval = interval) #apply splitter to each
    names(t) = NULL #remove names
    return(t)
  }
}

#plot again
ggplot(test.data, aes_string(x = "text", y = "mean")) +
  geom_point(stat="identity") +
  geom_errorbar(aes(ymax = CI.upper, ymin = CI.lower), width = .1) +
  scale_x_discrete(labels = add_newlines(test.data$text, 20), name = "")

And the output is: enter image description here

Then one can spend some time playing with the interval size to avoid having too much white-space between labels.

If the number of labels vary, this kind of solution is not so good, as the optimal interval size changes. Also, because the normal font is not mono-spaced, the text of the labels have an effect on the width too, and so one has to take extra care in selecting a good interval (one can avoid this by using a mono-space font, but they are extra wide). Finally, the new_lines_adder() function is stupid in that it will split words into two in silly ways humans would not do. E.g. in the above it split "breath" into "br\nreath". One could re-write it to avoid this problem.

One can also decrease the font size, but this is a trade off with the readability and often decreasing the font size is unnecessary.

What is the best way of handling this kind of label overplotting?

Solution

Building on @Stibu answer and comment, this solution takes into account number of groups and uses the intelligent splitting developed by Stibu, while adding a fix for words separated by a slash.

Functions:

#Inserts newlines into strings every N interval
new_lines_adder = function(x, interval) {
  #add spaces after /
  x = str_replace_all(x, "/", "/ ")
  #split at spaces
  x.split = strsplit(x, " ")[[1]]
  # get length of snippets, add one for space
  lens <- nchar(x.split) + 1
  # now the trick: split the text into lines with
  # length of at most interval + 1 (including the spaces)
  lines <- cumsum(lens) %/% (interval + 1)
  # construct the lines
  x.lines <- tapply(x.split, lines, function(line)
    paste0(paste(line, collapse=" "), "\n"), simplify = TRUE)
  # put everything into a single string
  result <- paste(x.lines, collapse="")
  #remove spaces we added after /
  result = str_replace_all(result, "/ ", "/")
  return(result)
}

#wrapper for the above, meant for users
add_newlines = function(x, total.length = 85) {
  # make sure, x is a character array   
  x = as.character(x)
  #determine number of groups
  groups = length(x)
  # apply splitter to each
  t = sapply(x, FUN = new_lines_adder, interval = round(total.length/groups), USE.NAMES=FALSE)
  return(t)
}

I tried some values for the default input and 85 is the value for which the text outcome is decent for the example data. Any higher and "veins" in label 2 gets moved up and gets too close to the third label.

Here's how it looks:

enter image description here

Still, it would be better to use a real measure of total text width, not number of characters as having to rely on this proxy generally means that the labels waste a lot of space. Maybe one could rewrite new_lines_adder() with some code based on strwidth to deal with the problem of unequal widths of characters.

I'm leaving this question unanswered in case someone can find a way to do this.

I have added the two functions to my personal package on github, so anyone who wants to use them, can fetch them from there.