(Not sure whether this question belongs to CrossValidated or Stackoverflow)
Subset of my data:
mdat1 <- structure(list(Name = structure(c(3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Bilbao",
"San Sebastian", "Vitoria"), class = "factor"), PrecipTotal = c(0,
1.01600203200406, 0, 6.09601219202438, 73.4061468122936, 4.31800863601727,
0, 0.254000508001016, 7.8740157480315, 5.58801117602235, 0, 0,
0, 0, 2.03200406400813, 0, 0.254000508001016, 0, 2.03200406400813,
0, 0, 0, 57.9121158242316, 1.77800355600711, 0, 0.762001524003048,
6.3500127000254, 0, 0, 1.27000254000508, 8.89001778003556, 1.01600203200406,
0, 0, 0, 0, 0.762001524003048, 0, 8.89001778003556, 0, 0, 21.8440436880874,
0, 0.508001016002032, 0, 0.508001016002032, 0.508001016002032,
0, 0, 0, 14.4780289560579, 0.254000508001016, 0.508001016002032,
0, 23.3680467360935, 6.09601219202438, 0, 0, 0, 0, 28.1940563881128,
0, 0, 0, 3.04800609601219, 0, 0, 0, 0, 6.09601219202438, 0, 2.03200406400813,
0, 4.06400812801626, 0, 0.508001016002032, 0, 0, 0.508001016002032,
7.11201422402845, 34.0360680721361, 0, 0, 0, 7.8740157480315,
0, 4.06400812801626, 0, 0, 0.508001016002032, 5.08001016002032,
7.11201422402845, 7.11201422402845, 0, 0, 0, 1.01600203200406,
0, 0, 0), Hail = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("Hail",
"NoHail"), class = "factor")), .Names = c("Name", "PrecipTotal",
"Hail"), row.names = c(43878L, 33821L, 40681L, 35121L, 45112L,
46428L, 45844L, 43199L, 34440L, 43184L, 32850L, 39220L, 38416L,
33860L, 34867L, 32737L, 43232L, 31772L, 35850L, 38894L, 39289L,
33148L, 32159L, 43197L, 43962L, 45068L, 41848L, 35929L, 34842L,
42069L, 39503L, 31747L, 43286L, 34919L, 43925L, 45368L, 42489L,
41686L, 43194L, 34747L, 37001L, 42923L, 45006L, 46170L, 33191L,
34392L, 44047L, 35859L, 42159L, 38843L, 45860L, 34180L, 33846L,
42810L, 46160L, 33523L, 34840L, 40226L, 42868L, 43576L, 46570L,
39980L, 42453L, 42063L, 38121L, 32822L, 40670L, 32859L, 46228L,
40239L, 32420L, 38874L, 39638L, 39523L, 31765L, 32753L, 33752L,
35574L, 36263L, 32871L, 32539L, 38455L, 41119L, 45124L, 34560L,
34144L, 41461L, 41449L, 35499L, 42783L, 34106L, 38151L, 36313L,
46593L, 39973L, 43928L, 35240L, 43626L, 46195L, 44388L), class = "data.frame")
Using the following code
cdplot(mdat1 [, 2], mdat1 [, 3], ylab = "", main = "1",
xlab = "",
col = c("purple", "gray"))
creates a messed up output ("1") of cdplot()
. Using a different sample of my original data produces the output labeled with "2"
I assume that it has something to do with the distribution of the x-values? If they are extremely skewed (like for "1"), the density calculation gets in trouble?
I'd say it was simply a bug, albeit one that you are warned rather vaguely about when the help page says "conditional densities are more reliable for high-density regions of x". Contrast all these efforts with the result you get with lattice's densityplot
. (Much cleaner and informative in my opinion.) The cdplot
and ggplot
efforts appear to seriously distort the data.
library(lattice)
densityplot(~PrecipTotal, groups=Hail, mdat1, col = c("purple", "gray"))
You can contrast that disply of the data with the output from that less pathological appearance you get from:
cdplot(Hail ~ PrecipTotal, data=mdat1, bw=2)
... but that still leaves you with the impression that there is substantial difference in the densities of the two groups in the region of 45-65 whereas the side-by-side display should you htat there is a gap in one and a single point in the other group which seems far more easily explained by random variation.
There is the fine point to be made that the lattice plotting argument convention is that separate plots result from a formula specification that includes the grouping variable, whereas bringing the grouping in with the groups=
mechanism includes them in the same plot region.