I am currently trying to learn r with the help of Hadley Wickham's great resources ("r for data scientists", "ggplot2 Elegant Graphics for Data Analysis"). So far I was able to find answers to all my problems there (thank you so much, Hadley!), but not this time.
Currently, I am working with data from an instrument that estimates particle size by the light the particles scatter (DLS, Zetasizer Nano, Malvern Instruments). The data extracted from this device are some summary statistics (e.g. mean particle size) and histogram data: x = size (split in bins), y = intensity [%]. Here is a tibble of one of my measurements:
# A tibble: 70 x 3
sample_name intensities bins
<chr> <dbl> <dbl>
1 core formulation 1 0 0.4
2 core formulation 1 0 0.463
3 core formulation 1 0 0.536
4 core formulation 1 0 0.621
5 core formulation 1 0 0.720
6 core formulation 1 0 0.833
7 core formulation 1 0 0.965
8 core formulation 1 0 1.12
9 core formulation 1 0 1.29
10 core formulation 1 0 1.50
11 core formulation 1 0 1.74
12 core formulation 1 0 2.01
13 core formulation 1 0 2.33
14 core formulation 1 0 2.70
15 core formulation 1 0 3.12
16 core formulation 1 0 3.62
17 core formulation 1 0 4.19
18 core formulation 1 0 4.85
19 core formulation 1 0 5.62
20 core formulation 1 0 6.50
21 core formulation 1 0 7.53
22 core formulation 1 0 8.72
23 core formulation 1 0 10.1
24 core formulation 1 0 11.7
25 core formulation 1 0 13.5
26 core formulation 1 0 15.7
27 core formulation 1 0 18.2
28 core formulation 1 0 21.0
29 core formulation 1 0 24.4
30 core formulation 1 0 28.2
31 core formulation 1 0 32.7
32 core formulation 1 0 37.8
33 core formulation 1 0 43.8
34 core formulation 1 0.2 50.8
35 core formulation 1 1.4 58.8
36 core formulation 1 3.7 68.1
37 core formulation 1 6.9 78.8
38 core formulation 1 10.2 91.3
39 core formulation 1 12.9 106.
40 core formulation 1 14.4 122.
41 core formulation 1 14.4 142.
42 core formulation 1 13 164.
43 core formulation 1 10.3 190.
44 core formulation 1 7.1 220.
45 core formulation 1 3.9 255
46 core formulation 1 1.5 295.
47 core formulation 1 0.2 342
48 core formulation 1 0 396.
49 core formulation 1 0 459.
50 core formulation 1 0 531.
51 core formulation 1 0 615.
52 core formulation 1 0 712.
53 core formulation 1 0 825
54 core formulation 1 0 955.
55 core formulation 1 0 1106
56 core formulation 1 0 1281
57 core formulation 1 0 1484
58 core formulation 1 0 1718
59 core formulation 1 0 1990
60 core formulation 1 0 2305
61 core formulation 1 0 2669
62 core formulation 1 0 3091
63 core formulation 1 0 3580
64 core formulation 1 0 4145
65 core formulation 1 0 4801
66 core formulation 1 0 5560
67 core formulation 1 0 6439
68 core formulation 1 0 7456
69 core formulation 1 0 8635
70 core formulation 1 0 10000
Here is the data produced with the dput()
command:
structure(list(sample_name = c("core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1", "core formulation 1",
"core formulation 1", "core formulation 1"), intensities = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.2, 1.4, 3.7, 6.9, 10.2, 12.9,
14.4, 14.4, 13, 10.3, 7.1, 3.9, 1.5, 0.2, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), bins = c(0.4,
0.4632, 0.5365, 0.6213, 0.7195, 0.8332, 0.9649, 1.117, 1.294,
1.499, 1.736, 2.01, 2.328, 2.696, 3.122, 3.615, 4.187, 4.849,
5.615, 6.503, 7.531, 8.721, 10.1, 11.7, 13.54, 15.69, 18.17,
21.04, 24.36, 28.21, 32.67, 37.84, 43.82, 50.75, 58.77, 68.06,
78.82, 91.28, 105.7, 122.4, 141.8, 164.2, 190.1, 220.2, 255,
295.3, 342, 396.1, 458.7, 531.2, 615.1, 712.4, 825, 955.4, 1106,
1281, 1484, 1718, 1990, 2305, 2669, 3091, 3580, 4145, 4801, 5560,
6439, 7456, 8635, 10000)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -70L))
I can produce a histogram with no problems from this data:
library(tidyverse)
ggplot (DLS_intensities_core, aes(bins,intensities) ) +
geom_line() +
scale_x_continuous(trans = 'log10')
In order to show the overall distribution of my particle size, I would like to convert this data into a violin plot and use the summary statistics provided by the device in a second layer of my plot.
Therefore, I would like to transform this data to be able to create a violin plot from it.
I have already tried feeding it to the stat_density () argument of the violin plot but so far with no success.
Do you know how to create a violin plot from this data?
Thank you very much!
Best,
Dominik
I found a solution to my problem, it is probably not very elegant:
library (tidyverse)
DLS_intensities_core <- DLS_intensities_core %>%
mutate(counts = intensities * 10 )
vectors <- DLS_intensities_core %>%
filter(counts > 0)
bins_v <- vectors$bins
count_v <- vectors$counts
violin_DLSdata <- as.tibble(rep.int(bins_v, count_v))
violin_DLSdata$sample_name <- "core formulation 1"
ggplot (violin_DLSdata, aes(sample_name, value)) +
geom_violin() +
labs(
x = NULL, y = "size"
) +
scale_y_continuous(trans = 'log10', limits = c(1, 1000))
for my whole dataset it looks like this:
I have added: summary statistic as red dots with errorbars.
What do you think?