Search code examples
rscalelogarithm

Transforming variable density on log scale with R


I want to plot the density of variable whose range is the following:

 Min.   :-1214813.0  
 1st Qu.:       1.0  
 Median :      40.0  
 Mean   :     303.2  
 3rd Qu.:     166.0  
 Max.   : 1623990.0

The linear plot of the density results in a tall column in range [0,1000], with two very long tails towards positive infinity and negative infinity. Hence, I'd like to transform the variable to a log scale, so that I can see what's going on around the mean. For example, I'm thinking of something like:

log_values = c( -log10(-values[values<0]), log10(values[values>0]))

which results in:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-6.085   0.699   1.708   1.286   2.272   6.211 

The main problem with this is the fact that it doesn't include the 0 values. Of course, I can shift all the values away from 0with values[values>=0]+1, but this would introduce some distortion in the data.

What would be an accepted and scientifically solid way of transforming this variable to the log scale?


Solution

  • Apart from transforming, you can manipulate the histogram itself to get an idea about your data. This gives you the advantage that the plots itself stays readible and you get an immediate idea about the distribution in the center. Say we simulate the following data:

    Data <- c(rnorm(1000,5,10),sample(-10000:10000,10))
    > summary(Data)
         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    -9669.000    -2.119     5.332    85.430    12.460  9870.000 
    

    Then you have a few different approaches. The easiest to see what is going on in the center of your data, is just plot the center of your data. In this case, say I'm interested in what happens between the first and the third quartile, I can plot :

    hist(Data,
         xlim=c(-30,30),
         breaks=c(min(Data),seq(-30,30,by=5),max(Data))
         main="Center of Data"
         )
    

    enter image description here

    If you want to count the tails as well, you can transform your data to collapse the tails and alter the axis to reflect this, as follows :

    1. you assign all values outside the range of interest a value that's just outside that range
    2. you plot the histogram, binning all extreme values in one bin
    3. you construct the X axis with the correct labels
    4. you use axis.break() from the package plotrix to add some breaks on your X axis, indicating the discontinuous axis

    For that you can use something like the following code:

     require(plotrix)
     # rearrange data
     plotdata <- Data
     id <- plotdata < -30 | plotdata > 30
     plotdata[id] <- sign(plotdata[id])*35
     # plot histogram
     hist(plotdata,
          xlim=c(-40,40),
          breaks=c(-40,seq(-30,30,by=5),40),
          main="Untailed Data",
          xaxt='n'   # leave the X axis away
          )
     # Construct the X axis
     axis(1,
          at=c(-40,seq(-30,30,by=10),40),
          labels=c(min(Data),seq(-30,30,by=10),max(Data))
     )
     # add axis breaks
     axis.break(axis=1,breakpos=-35)
     axis.break(axis=1,breakpos=35)
    

    This gives you :

    enter image description here

    Note that you get raw frequencies by adding freq=TRUE to the hist() function.