Background
For teaching purpose, I use simulations (mainly in R) to help students (in the social science...no math or stats background) grasping some "tough" concepts/ideas behind some stats topics. I am planning to lecture about chi-squared test of independence, and I have prepared a small 2x2 contingency table which cross-tabulates GENDER (two levels: M and F) vs. POLITICAL AFFILIATION (two levels: PartyA, PartyB). In this toy dataset, there is a significant dependence.
Goal I have in mind
In order to help students understanding the sampling distribution of the chi-squared statistic under the Null Hypothesis, I would like to simulate a population in which the above-mentioned two variables are independent. I would like to do so so that: (1) I can randomly draw a random sample, cross-tabulate the two variables, and show that the chi-sq test turns to be not significant, and (2) I can draw B random samples, calculate the chi-squared statistic B times, and plot a frequency distribution histogram of those B chi-square values (this should represent the sampling distribution of chi-sq under the Null Hypothesis).
Where I need help
I cannot figure out a way of simulating a population where those 2 categorical variables are independent. Ideally, I would like to come up with a dataframe with a number of rows and two columns: each row would represent an observation (an individual in our case), while each column would store (for each observation) a level of each categorical variable being analyzed (i.e., the GENDER and the POLITICAL AFFILIATION).
You can use sample
with the replace =
argument set as TRUE
, and combine the new vectors into variables with data.frame
:
dat <- data.frame(gender = sample(c("F", "M"), size = 1000, replace = TRUE),
party = sample(c("Party A", "Party B"), size = 1000, replace = TRUE))
Because you're generating these two variables separately, they won't be associated with one another.