I have a data frame like this:
distance exclude
1.1 F
1.5 F
3 F
2 F
1 F
5 T
3 F
63 F
32 F
21 F
15 F
1 T
I want get the four boxplot stats of each segment of data in distance column separated by "T" in exclude column, here "T" serves as separator.
Can anyone help? Thanks so much!
First, let's create some fake data:
library(dplyr)
# Fake data
set.seed(49349)
dat = data.frame(distance=rnorm(500, 50, 10),
exclude=sample(c("T","F"), 500, replace=TRUE, prob=c(0.03,0.95)))
Now create a new group each time exclude
= "T". Then, for each group, and calculate whatever statistics you wish and return the results in a data frame:
box.stats = dat %>%
mutate(group = cumsum(exclude=="T")) %>%
group_by(group) %>%
do(data.frame(n=length(.$distance),
out_90 = sum(.$distance > quantile(.$distance, 0.9)),
out_10 = sum(.$distance < quantile(.$distance, 0.1)),
MEAN = round(mean(.$distance),2),
SD = round(sd(.$distance),2),
out_2SD_high = sum(.$distance > mean(.$distance) + 2*sd(.$distance)),
round(t(quantile(.$distance, probs=c(0,0.1,0.25,0.5,0.75,0.9,1))),2)))
names(box.stats) = gsub("X(.*)\\.$", "p\\1", names(box.stats))
box.stats
group n out_90 out_10 MEAN SD out_2SD_high p0 p10 p25 p50 p75 p90 p100 1 0 15 2 2 46.21 8.78 0 28.66 36.03 41.88 46.04 52.33 56.30 61.98 2 1 36 4 4 50.03 10.01 0 21.71 38.78 44.63 51.13 56.66 61.58 67.84 3 2 80 8 8 50.36 9.00 1 20.30 38.10 45.95 51.28 56.51 61.74 70.44 4 3 9 1 1 55.62 8.58 0 42.11 47.10 49.19 54.54 63.63 65.84 67.88 5 4 16 2 2 47.70 7.79 0 29.03 39.89 43.60 49.26 52.92 56.97 58.02 6 5 66 7 7 49.86 9.93 2 24.84 36.00 45.05 50.51 55.65 61.41 75.27 7 6 44 5 5 50.35 10.39 1 31.72 36.36 43.49 50.95 55.78 64.88 73.64 8 7 80 8 8 49.18 9.24 1 27.62 37.86 42.06 50.34 56.60 59.66 72.13 9 8 31 3 3 52.56 11.18 0 25.78 39.94 44.10 51.32 62.02 66.35 70.40 10 9 60 6 6 50.31 9.82 1 25.43 37.44 44.53 50.31 56.78 62.36 71.77 11 10 33 4 4 49.99 9.78 2 32.74 38.72 42.56 49.60 55.75 62.86 72.20 12 11 30 3 3 48.26 11.47 1 30.03 37.68 40.24 45.65 55.42 60.18 79.36