I want to plot the density of variable whose range is the following:
Min. :-1214813.0
1st Qu.: 1.0
Median : 40.0
Mean : 303.2
3rd Qu.: 166.0
Max. : 1623990.0
The linear plot of the density results in a tall column in range [0,1000], with two very long tails towards positive infinity and negative infinity. Hence, I'd like to transform the variable to a log scale, so that I can see what's going on around the mean. For example, I'm thinking of something like:
log_values = c( -log10(-values[values<0]), log10(values[values>0]))
which results in:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-6.085 0.699 1.708 1.286 2.272 6.211
The main problem with this is the fact that it doesn't include the 0
values.
Of course, I can shift all the values away from 0
with values[values>=0]+1
, but this would introduce some distortion in the data.
What would be an accepted and scientifically solid way of transforming this variable to the log scale?
Apart from transforming, you can manipulate the histogram itself to get an idea about your data. This gives you the advantage that the plots itself stays readible and you get an immediate idea about the distribution in the center. Say we simulate the following data:
Data <- c(rnorm(1000,5,10),sample(-10000:10000,10))
> summary(Data)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-9669.000 -2.119 5.332 85.430 12.460 9870.000
Then you have a few different approaches. The easiest to see what is going on in the center of your data, is just plot the center of your data. In this case, say I'm interested in what happens between the first and the third quartile, I can plot :
hist(Data,
xlim=c(-30,30),
breaks=c(min(Data),seq(-30,30,by=5),max(Data))
main="Center of Data"
)
If you want to count the tails as well, you can transform your data to collapse the tails and alter the axis to reflect this, as follows :
axis.break()
from the package plotrix
to add some breaks on your X axis, indicating the discontinuous axisFor that you can use something like the following code:
require(plotrix)
# rearrange data
plotdata <- Data
id <- plotdata < -30 | plotdata > 30
plotdata[id] <- sign(plotdata[id])*35
# plot histogram
hist(plotdata,
xlim=c(-40,40),
breaks=c(-40,seq(-30,30,by=5),40),
main="Untailed Data",
xaxt='n' # leave the X axis away
)
# Construct the X axis
axis(1,
at=c(-40,seq(-30,30,by=10),40),
labels=c(min(Data),seq(-30,30,by=10),max(Data))
)
# add axis breaks
axis.break(axis=1,breakpos=-35)
axis.break(axis=1,breakpos=35)
This gives you :
Note that you get raw frequencies by adding freq=TRUE
to the hist()
function.