I am having some trouble wording my issue, so I am using the mtcars dataset as an example.
Imagine I am student of social sciences in the Pixar Cars(TM) universe. For a small school project on statistical methods, I am doing a survey amongst my peers. My target is to collect data on a sample of 30 cars, half of which are automatic, and the other half is manual. After my online survey is closed, and I have cleaned up my data, it looks like the mtcars dataset.
data(mtcars)
str(mtcars)
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("automatic", "manual") # because anthropomorphic cars prefer factors with levels over binary code
If I use table(mtcars$am)
, I find out that there were 19 automatic and 13 manual transmission cars in the dataset. Looks like I didn't make the target to have an equal number of manual and automatic cars :(! Luckily, as a car-sociologist, I can fix this by weighing my dataset. I divide the target # by the collected # to get the weight of each observation. Thus, all automatic cars should get a weight of 0.7894 (19/15) and manual cars a weight of 1.1538 (13/15). Assigning the correct weight to each observation is a fairly straightforward:
mtcars$weight <- ifelse(mtcars$am == "automatic", 0.7894737, 1.153846)
You can imagine that this method becomes a bit cumbersome with larger datasets with more weight-categories. Is there a way to automate the process of assigning the weights to each observation?
As a car and self-taught R-user who mainly cobbles things together as-needed, I don't really know where to start. I've been using the method above, but due to an enlarged number of target-groups it's not really sustainable anymore.
I of course did attempt to find the answer elsewhere on the WWW, but not very successfully unfortunately. The following question seemed promising, but doesn't provide a solution for me:
R: new variable values based on factor levels of another variable
Generally, you have sample proportions that exceed or fall short of the expected population proportions. So you want to weight the sample proportions to bring them in line with population proportions. You can get the weights by dividing the former by the latter.
Let's demonstrate this by the number of carburetors provided in mtcars
. Say the known/expected proportion is:
carb_pop <- c(.25, .28, .1, .28, .05, .04) |> setNames(c(1:4, 6, 8))
carb_pop
# 1 2 3 4 6 8
# 0.25 0.28 0.10 0.28 0.05 0.04
However, in the sample we have:
carb_smp <- table(mtcars$carb)
proportions(carb_smp)
# 1 2 3 4 6 8
# 0.21875 0.31250 0.09375 0.31250 0.03125 0.03125
Now we can create a named vector w
with weights:
w <- carb_pop/proportions(carb_smp)
w
# 1 2 3 4 6 8
# 1.142857 0.896000 1.066667 0.896000 1.600000 1.280000
that brings the proportions in line,
all(carb_pop == w*proportions(carb_smp))
# [1] TRUE
We now can use the named vector to create weights in a match
approach similar to that you've seen in your linked question.
mtcars$weights <- w[match(mtcars$carb, names(w))]
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb weights
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 0.896000
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 0.896000
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1.142857
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 1.142857
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 0.896000
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 1.142857