Search code examples
rggplot2data-visualizationhistogram

Problems understanding log-log ggplots


I'm working with a very large data set (too large to post here) and I'm really struggling with creating a histogram that looks right. This was my best try with the original data:

g <- ggplot(df2, aes(x = n))
g <- g + geom_histogram(color = "white", fill = "firebrick3", bins = 47)
g <- g + scale_x_continuous(trans = 'log10', 
        breaks = trans_breaks('log10', function(x) 10^x), 
        labels = trans_format(math_format(10^.x)))
g <- g + scale_y_continuous(trans = 'log10',
        breaks = trans_breaks('log10', function(x) 10^x), 
        labels = trans_format(math_format(10^.x)), 
        oob = squish_infinite)
g <- g + annotation_logticks()
g <- g + labs(x = "n", y = "log(Count)")
g

This did not produce a plot; instead, it threw one error and two warnings:

  • Error in x * scale : non-numeric argument to binary operator I don't understand this since the x values (n) are integers. This was still the case when I ran df2$n <- as.numeric(df2$n)
  • Transformation introduced infinite values in continuous y-axis I thought this would be handled by the oob = squish_infinite argument to scale_y_continuous()
  • Removed 2 rows containing missing values (geom_bar). I assume this is because with the number of bins that I specified, some bins have zero values. This is correct.

In an attempt to make something that could be attempted by others, I ran a collection of lines that counted the number of times each n appeared (effectively making the histogram by hand). Here is that data:

n <- c(2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30, 40, 50, 60, 
70, 80, 90, 100, 200, 300, 400, 500, 600, 700, 800, 900, 
1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 
10000, 20000, 30000, 40000, 50000, 60000, 70000, 80000, 
90000, 100000, 20000)
counts <- c(885452, 468462, 222097, 166234, 103348, 85845, 
60798, 52651, 231830, 81138, 41333, 25274, 17192, 12465, 
9622, 7371, 6069, 27160, 9009, 4465, 2753, 1664, 1285, 918, 
716, 568, 2400, 707, 362, 180, 106, 90, 55, 55, 39, 124, 
25, 12, 8, 2, 1, 0, 2, 0, 3, 2)

These were constructed using the [) format; e.g., the number of counts corresponding to n = 30 counts all of the n's appearing (30, 31, 32, 33, 34, 35, 36, 37, 38, 39) times.

The final histogram should:

  • Appear on a log-log scale
  • Only the major ticks should be labeled (both axes); i.e., 10^0, 10^1, 10^2, ...
  • Minor tick marks should appear on both axes non-linearly (like any log-log plot)
  • The white lines on the grey background should correspond with the tick marks and not be a linear x-y grid
  • Each bar in the histogram should be centered above a tick mark (this is why I specified bins = 47 in my original try; 30 bins is clearly not appropriate for this)

I think I've missed something fundamental - any ideas?

Updating things with the most recent suggestions, the code is:

g <- ggplot(bigram2, aes(x = n))
g <- g + geom_histogram(color = "white", fill = "firebrick3", bins = 47)
g <- g + scale_x_continuous(trans = 'log10',
        breaks = trans_breaks('log10', function(x) 10^x), 
        labels = trans_format('log10', math_format(10^.x)))
g <- g + scale_y_continuous(trans = 'log10',
        breaks = trans_breaks('log10', function(x) 10^x), 
        labels = trans_format('log10', math_format(10^.x)))
g <- g + annotation_logticks()
g <- g + labs(x = "n", y = "log(Count)")
g

and the resulting plot looks like this: enter image description here


Solution

  • OP, you're on the right track here. Ultimately, the issue comes down to a typo :/. I'll explain the 3 messages you received when trying your original code, then show you an example with dummy data that should be applicable to your dataset.

    Your error messages.

    OP references three messages received when running the code. Let's explain them (out of sequence):

    • Removed 2 rows containing missing values (geom_bar). This should not be an error, but a warning. It will not be relevant here, since it's just letting you know that a few have no value, so there is nothing to draw. You can safely ignore this.

    • Transformation introduced infinite values in continuous y-axis. This is also a warning message and can be safely ignored. It is expected that you have infinite values on the continuous y-axis when doing a log transformation when you have some bins that will have 0 counts. This is because log10(0) evaluates to -Inf. The plot is still able to be made, but these bins are the ones that are "removed" most likely. In your case, OP, you probably have a histogram with two of the bins in the sequence removed... because they contain nothing. No worries here.

    • Error in x * scale : non-numeric argument to binary operator. This one pops up because you effectively have a typo in your reference to trans_format() in the scale_*_continuous() functions. The function expects a trans= argument first (much like trans_breaks()), but you only specify the format via math_format(). When math_format() is applied to the trans= argument in trans_format()... you get that error.

    Fixing the error message

    The fix is pretty simple, which is to specify "log10" in trans_format(). In other words, use this: scale_*_continuous(... labels = trans_format("log10", math_format(10^.x)...), and not this scale_*_continuous(... labels = trans_format(math_format(10^.x)...)

    I'll show this via a dummy dataset:

    set.seed(1234)
    d <- data.frame(n=sample(1:10000, size=1000000, replace=T))
    

    Here's a histogram without the log transformations:

    p <- ggplot(d, aes(x=n)) + geom_histogram(bins=30, color='black', fill='steelblue')
    p
    

    enter image description here

    And the log-log transformation:

    p +
      scale_x_continuous(
        trans='log10',
        breaks = trans_breaks('log10', function(x) 10^x), 
        labels = trans_format('log10', math_format(10^.x))) +
      scale_y_continuous(
        trans='log10',
        breaks = trans_breaks('log10', function(x) 10^x), 
        labels = trans_format('log10', math_format(10^.x))
        )
    

    enter image description here