Search code examples
rggplot2histogramtransformationviolin-plot

transform histogram to violin plot in r with ggplot


I am currently trying to learn r with the help of Hadley Wickham's great resources ("r for data scientists", "ggplot2 Elegant Graphics for Data Analysis"). So far I was able to find answers to all my problems there (thank you so much, Hadley!), but not this time.

Currently, I am working with data from an instrument that estimates particle size by the light the particles scatter (DLS, Zetasizer Nano, Malvern Instruments). The data extracted from this device are some summary statistics (e.g. mean particle size) and histogram data: x = size (split in bins), y = intensity [%]. Here is a tibble of one of my measurements:

   # A tibble: 70 x 3
   sample_name        intensities      bins
   <chr>                    <dbl>     <dbl>
 1 core formulation 1         0       0.4  
 2 core formulation 1         0       0.463
 3 core formulation 1         0       0.536
 4 core formulation 1         0       0.621
 5 core formulation 1         0       0.720
 6 core formulation 1         0       0.833
 7 core formulation 1         0       0.965
 8 core formulation 1         0       1.12 
 9 core formulation 1         0       1.29 
10 core formulation 1         0       1.50 
11 core formulation 1         0       1.74 
12 core formulation 1         0       2.01 
13 core formulation 1         0       2.33 
14 core formulation 1         0       2.70 
15 core formulation 1         0       3.12 
16 core formulation 1         0       3.62 
17 core formulation 1         0       4.19 
18 core formulation 1         0       4.85 
19 core formulation 1         0       5.62 
20 core formulation 1         0       6.50 
21 core formulation 1         0       7.53 
22 core formulation 1         0       8.72 
23 core formulation 1         0      10.1  
24 core formulation 1         0      11.7  
25 core formulation 1         0      13.5  
26 core formulation 1         0      15.7  
27 core formulation 1         0      18.2  
28 core formulation 1         0      21.0  
29 core formulation 1         0      24.4  
30 core formulation 1         0      28.2  
31 core formulation 1         0      32.7  
32 core formulation 1         0      37.8  
33 core formulation 1         0      43.8  
34 core formulation 1         0.2    50.8  
35 core formulation 1         1.4    58.8  
36 core formulation 1         3.7    68.1  
37 core formulation 1         6.9    78.8  
38 core formulation 1        10.2    91.3  
39 core formulation 1        12.9   106.   
40 core formulation 1        14.4   122.   
41 core formulation 1        14.4   142.   
42 core formulation 1        13     164.   
43 core formulation 1        10.3   190.   
44 core formulation 1         7.1   220.   
45 core formulation 1         3.9   255    
46 core formulation 1         1.5   295.   
47 core formulation 1         0.2   342    
48 core formulation 1         0     396.   
49 core formulation 1         0     459.   
50 core formulation 1         0     531.   
51 core formulation 1         0     615.   
52 core formulation 1         0     712.   
53 core formulation 1         0     825    
54 core formulation 1         0     955.   
55 core formulation 1         0    1106    
56 core formulation 1         0    1281    
57 core formulation 1         0    1484    
58 core formulation 1         0    1718    
59 core formulation 1         0    1990    
60 core formulation 1         0    2305    
61 core formulation 1         0    2669    
62 core formulation 1         0    3091    
63 core formulation 1         0    3580    
64 core formulation 1         0    4145    
65 core formulation 1         0    4801    
66 core formulation 1         0    5560    
67 core formulation 1         0    6439    
68 core formulation 1         0    7456    
69 core formulation 1         0    8635    
70 core formulation 1         0   10000    

Here is the data produced with the dput() command:

structure(list(sample_name = c("core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1", "core formulation 1", 
"core formulation 1", "core formulation 1"), intensities = c(0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.2, 1.4, 3.7, 6.9, 10.2, 12.9, 
14.4, 14.4, 13, 10.3, 7.1, 3.9, 1.5, 0.2, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), bins = c(0.4, 
0.4632, 0.5365, 0.6213, 0.7195, 0.8332, 0.9649, 1.117, 1.294, 
1.499, 1.736, 2.01, 2.328, 2.696, 3.122, 3.615, 4.187, 4.849, 
5.615, 6.503, 7.531, 8.721, 10.1, 11.7, 13.54, 15.69, 18.17, 
21.04, 24.36, 28.21, 32.67, 37.84, 43.82, 50.75, 58.77, 68.06, 
78.82, 91.28, 105.7, 122.4, 141.8, 164.2, 190.1, 220.2, 255, 
295.3, 342, 396.1, 458.7, 531.2, 615.1, 712.4, 825, 955.4, 1106, 
1281, 1484, 1718, 1990, 2305, 2669, 3091, 3580, 4145, 4801, 5560, 
6439, 7456, 8635, 10000)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -70L))

I can produce a histogram with no problems from this data:

library(tidyverse)
ggplot (DLS_intensities_core, aes(bins,intensities) ) + 
  geom_line() + 
  scale_x_continuous(trans = 'log10')

line plot, intensities against bins

In order to show the overall distribution of my particle size, I would like to convert this data into a violin plot and use the summary statistics provided by the device in a second layer of my plot.

Therefore, I would like to transform this data to be able to create a violin plot from it.

I have already tried feeding it to the stat_density () argument of the violin plot but so far with no success.

Do you know how to create a violin plot from this data?

Thank you very much!

Best,

Dominik


Solution

  • I found a solution to my problem, it is probably not very elegant:

    library (tidyverse)
    
    DLS_intensities_core <- DLS_intensities_core %>% 
      mutate(counts = intensities * 10 )
    
    vectors <- DLS_intensities_core %>%
      filter(counts > 0) 
    
    bins_v <- vectors$bins
    count_v <- vectors$counts
    
    violin_DLSdata <- as.tibble(rep.int(bins_v, count_v))
    violin_DLSdata$sample_name <- "core formulation 1"
    
    ggplot (violin_DLSdata, aes(sample_name, value)) + 
      geom_violin() + 
      labs(
        x = NULL, y = "size"
      ) +
      scale_y_continuous(trans = 'log10', limits = c(1, 1000))
    

    violin_plot 1 for my whole dataset it looks like this: violin_plot with summary statistics I have added: summary statistic as red dots with errorbars.

    What do you think?