I am making a function Prop.Histogram()
that plots data as a histogram showing the proportions with a normal distribution curve added to it. Addition of the curve was difficult for me to achieve, but I succeeded (see code below)!
Note: I personally prefer to work with the pipe-operator %>%
from the package magrittr in my codes. Though, as probably not everyone is familiar with this operator and/or this package (or they prefer not to use it), I'll also provide the same code without using magrittr below.
Code using magrittr
Prop.Histogram <- function(data,
xlim_min, xlim_max, x_BreakSize,
ylim_max, y_steps) {
# Load packages
library(magrittr)
# Make histogram of data without y-axis
hist(data, freq = FALSE, ylab = "Proportion",
xlim = c(xlim_min, xlim_max), breaks = seq(from = xlim_min, to = xlim_max, by = x_BreakSize),
ylim = c(0, ylim_max %>% divide_by(., x_BreakSize)), yaxt = "n")
# I divided ylim_max by x_BreakSize, as I want ylim_max to be equal to the max proportion shown on the y_axis (and not to the max density)
# Add y-axis that shows proportion and not density
axis(side = 2,
at = seq(from = 0, to = ylim_max %>% divide_by(., x_BreakSize), by = y_steps %>% divide_by(., x_BreakSize)),
labels = seq(from = 0, to = ylim_max, by = y_steps))
box()
# Add curve to histogram
curve(dnorm(x, mean = mean(data), sd = sd(data)), lwd = 5, add = TRUE, yaxt = "n")
}
Same code without using magrittr
Prop.Histogram <- function(data,
xlim_min, xlim_max, x_BreakSize,
ylim_max, y_steps) {
# Load packages
library(magrittr)
# Make histogram of data without y-axis
hist(data, freq = FALSE, ylab = "Proportion",
xlim = c(xlim_min, xlim_max), breaks = seq(from = xlim_min, to = xlim_max, by = x_BreakSize),
ylim = c(0, ylim_max/x_BreakSize), yaxt = "n")
# I divided ylim_max by x_BreakSize, as I want ylim_max to be equal to the max proportion shown on the y_axis (and not to the max density)
# Add y-axis that shows proportion and not density
axis(side = 2,
at = seq(from = 0, to = ylim_max/x_BreakSize, by = y_steps/x_BreakSize),
labels = seq(from = 0, to = ylim_max, by = y_steps))
box()
# Add curve to histogram
curve(dnorm(x, mean = mean(data), sd = sd(data)), lwd = 5, add = TRUE, yaxt = "n")
}
This code does exactly what I want it to do: it plots the proportions and adds a normal distribution curve to the plot. Though, I do have difficulties understanding why addition of the curve actually works.
Main question (1): I have to put x
as the first argument in dnorm()
, and even though I have not defined x
, it works! So my first and main question is: what is x
, what does it do, and why does it work in my function?
Second question (2): My second question is whether it is possible (and, if so, how) to use magrittr pipe-operators (%>%
) in the line of code that adds the curve to the plot. (Even if using operators is not the best way to do so in this case, I am still interested in the answer as I am eager to learn!)
First of all, for those who want to try out my code: here is some data that is representative of data that I want to plot:
data <- rnorm(724, mean = 84, sd = 33)
Prop.Histogram(data,
xlim_min = -50, xlim_max = 200, x_BreakSize = 10,
ylim_max = 0.15, y_step = 0.05)
Main question (1): role of x
in dnorm()
/curve()
I started by using data
instead of x
as the first argument of dnorm()
, but this didn't work as it resulted in the following error message:
Error in curve(dnorm(data, mean = mean(data), sd = sd(data)), lwd = 5, :
'expr' must be a function, or a call or an expression containing 'x'
But then, when I take dnorm(data, mean = mean(data), sd = sd(data))
and run it individually (not as an argument of curve()
, it gives me 724 values (of which I don't know what they meaning, but at least it's not an error message). Which is weird, since using data
as the first argument when dnorm()
is part of curve
in my formula results in an error message as we saw previously.
Then, when I change data
for x
and run dnorm(x, mean = mean(data), sd = sd(data))
(again not as an argument of curve()
), it gives me another error message:
Error in dnorm(x, mean = mean(data), sd = sd(data)) :
object 'x' not found
This I can understand, as I've not defined x
anywhere in my code. But that rises the question: why do I not get this same error message when I run my (working) function.
In short, I observed that x
must be the first argument in dnorm()
when dnorm()
is used as an argument in curve()
, but x
cannot be used as the first argument when dnorm()
is used individually. Conclusion: I am lost.
Of course, when I am lost in R, I always look at the help page of R. The help page of dnorm()
states that x
is a vector of quantiles... that's it. I know those words individually, but have no idea what it means in my code (as I've not defined x
, so what vector or what quantiles is the R help page talking about?).
Second question (2): use of magrittr in code
I've tried to write the code curve(dnorm(x, mean = mean(data), sd = sd(data)), lwd = 5, add = TRUE, yaxt = "n")
using magrittr, but it does not work. Here are some examples I've tried:
data %>% dnorm(x, mean = mean(.), sd = sd(.)) %>% curve(., lwd = 5, add = TRUE, yaxt = "n")
data %>% dnorm(x, mean = mean(.), sd = sd(.)) %>% curve(lwd = 5, add = TRUE, yaxt = "n")
dnorm(x, mean = mean(data), sd = sd(data)) %>% curve(., lwd = 5, add = TRUE, yaxt = "n")
They all result in the same error message:
Error in dnorm(x, mean = mean(data), sd = sd(data)) :
object 'x' not found
I'd like to know if it's possible to use magrittr operators like %>%
in this situation (even if it's not the best option).
PS. This is my first time posting, so please feel free to give feedback or ask me for more information if needed. Thank you in advance!
The curve()
function uses non-standard evaluation. x
is just a placeholder in the expression that it will plot. See ?curve
for details.
In fact, x
doesn't need to be the first argument, it can appear anywhere in the expression. But you would want it to be attached to the first argument of dnorm
, so putting it first works well. If you want to see the effect of the sd
argument on the density at 0, you could use
curve(dnorm(0, sd = x))
When you do put it first, the dummy x
that curve()
is looking for will be bound to the first argument of dnorm()
, which happens to also be named x
, as you saw on the help page. It is the location at which you want to calculate the density.
When you called dnorm(data, mean = mean(data), sd = sd(data))
you were asking it to calculate the density of a normal distribution with mean mean(data)
and standard deviation sd(data)
at each of the locations in data
. That's why you got a long vector response.
For your second question: magrittr
passes the result of things on the left of the pipe into the function call on the right. There are some complicated rules for where those results appear:
If you don't use .
in the function call, the value is used as the first argument.
If you do use .
, the argument appears there, but maybe also in the first place. I forget the exact rules; see ?pipe
for details.
So to get what you want, you could do this:
data %>% {curve(dnorm(x, mean = mean(.), sd = sd(.), lwd = 5, add = TRUE, yaxt = "n")}
I had to use the curly brackets to get magrittr
to handle the .
properly.