Search code examples
rdataframeassociationscategorical-datachi-squared

R: simulating a population where two categorical variables are independent


Background

For teaching purpose, I use simulations (mainly in R) to help students (in the social science...no math or stats background) grasping some "tough" concepts/ideas behind some stats topics. I am planning to lecture about chi-squared test of independence, and I have prepared a small 2x2 contingency table which cross-tabulates GENDER (two levels: M and F) vs. POLITICAL AFFILIATION (two levels: PartyA, PartyB). In this toy dataset, there is a significant dependence.

Goal I have in mind

In order to help students understanding the sampling distribution of the chi-squared statistic under the Null Hypothesis, I would like to simulate a population in which the above-mentioned two variables are independent. I would like to do so so that: (1) I can randomly draw a random sample, cross-tabulate the two variables, and show that the chi-sq test turns to be not significant, and (2) I can draw B random samples, calculate the chi-squared statistic B times, and plot a frequency distribution histogram of those B chi-square values (this should represent the sampling distribution of chi-sq under the Null Hypothesis).

Where I need help

I cannot figure out a way of simulating a population where those 2 categorical variables are independent. Ideally, I would like to come up with a dataframe with a number of rows and two columns: each row would represent an observation (an individual in our case), while each column would store (for each observation) a level of each categorical variable being analyzed (i.e., the GENDER and the POLITICAL AFFILIATION).


Solution

  • You can use sample with the replace = argument set as TRUE, and combine the new vectors into variables with data.frame:

    dat <- data.frame(gender = sample(c("F", "M"), size = 1000, replace = TRUE),
                      party = sample(c("Party A", "Party B"), size = 1000, replace = TRUE))
    

    Because you're generating these two variables separately, they won't be associated with one another.