Search code examples
rdensity-plot

Problems with density calculation of cdplot() in R


(Not sure whether this question belongs to CrossValidated or Stackoverflow)

Subset of my data:

mdat1 <- structure(list(Name = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Bilbao", 
"San Sebastian", "Vitoria"), class = "factor"), PrecipTotal = c(0, 
1.01600203200406, 0, 6.09601219202438, 73.4061468122936, 4.31800863601727, 
0, 0.254000508001016, 7.8740157480315, 5.58801117602235, 0, 0, 
0, 0, 2.03200406400813, 0, 0.254000508001016, 0, 2.03200406400813, 
0, 0, 0, 57.9121158242316, 1.77800355600711, 0, 0.762001524003048, 
6.3500127000254, 0, 0, 1.27000254000508, 8.89001778003556, 1.01600203200406, 
0, 0, 0, 0, 0.762001524003048, 0, 8.89001778003556, 0, 0, 21.8440436880874, 
0, 0.508001016002032, 0, 0.508001016002032, 0.508001016002032, 
0, 0, 0, 14.4780289560579, 0.254000508001016, 0.508001016002032, 
0, 23.3680467360935, 6.09601219202438, 0, 0, 0, 0, 28.1940563881128, 
0, 0, 0, 3.04800609601219, 0, 0, 0, 0, 6.09601219202438, 0, 2.03200406400813, 
0, 4.06400812801626, 0, 0.508001016002032, 0, 0, 0.508001016002032, 
7.11201422402845, 34.0360680721361, 0, 0, 0, 7.8740157480315, 
0, 4.06400812801626, 0, 0, 0.508001016002032, 5.08001016002032, 
7.11201422402845, 7.11201422402845, 0, 0, 0, 1.01600203200406, 
0, 0, 0), Hail = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Hail", 
"NoHail"), class = "factor")), .Names = c("Name", "PrecipTotal", 
"Hail"), row.names = c(43878L, 33821L, 40681L, 35121L, 45112L, 
46428L, 45844L, 43199L, 34440L, 43184L, 32850L, 39220L, 38416L, 
33860L, 34867L, 32737L, 43232L, 31772L, 35850L, 38894L, 39289L, 
33148L, 32159L, 43197L, 43962L, 45068L, 41848L, 35929L, 34842L, 
42069L, 39503L, 31747L, 43286L, 34919L, 43925L, 45368L, 42489L, 
41686L, 43194L, 34747L, 37001L, 42923L, 45006L, 46170L, 33191L, 
34392L, 44047L, 35859L, 42159L, 38843L, 45860L, 34180L, 33846L, 
42810L, 46160L, 33523L, 34840L, 40226L, 42868L, 43576L, 46570L, 
39980L, 42453L, 42063L, 38121L, 32822L, 40670L, 32859L, 46228L, 
40239L, 32420L, 38874L, 39638L, 39523L, 31765L, 32753L, 33752L, 
35574L, 36263L, 32871L, 32539L, 38455L, 41119L, 45124L, 34560L, 
34144L, 41461L, 41449L, 35499L, 42783L, 34106L, 38151L, 36313L, 
46593L, 39973L, 43928L, 35240L, 43626L, 46195L, 44388L), class = "data.frame")

Using the following code

cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "", main = "1", 
                 xlab = "", 
                 col = c("purple", "gray"))

creates a messed up output ("1") of cdplot(). Using a different sample of my original data produces the output labeled with "2"

enter image description here

I assume that it has something to do with the distribution of the x-values? If they are extremely skewed (like for "1"), the density calculation gets in trouble?

enter image description here


Solution

  • I'd say it was simply a bug, albeit one that you are warned rather vaguely about when the help page says "conditional densities are more reliable for high-density regions of x". Contrast all these efforts with the result you get with lattice's densityplot. (Much cleaner and informative in my opinion.) The cdplot and ggplot efforts appear to seriously distort the data.

    library(lattice)
    densityplot(~PrecipTotal, groups=Hail, mdat1, col = c("purple", "gray"))
    

    You can contrast that disply of the data with the output from that less pathological appearance you get from:

    cdplot(Hail ~ PrecipTotal, data=mdat1, bw=2)
    

    ... but that still leaves you with the impression that there is substantial difference in the densities of the two groups in the region of 45-65 whereas the side-by-side display should you htat there is a gap in one and a single point in the other group which seems far more easily explained by random variation.

    enter image description here

    There is the fine point to be made that the lattice plotting argument convention is that separate plots result from a formula specification that includes the grouping variable, whereas bringing the grouping in with the groups= mechanism includes them in the same plot region.