Search code examples
rweighted

Automatically assign weights to variables based on factor level


I am having some trouble wording my issue, so I am using the mtcars dataset as an example.

Imagine I am student of social sciences in the Pixar Cars(TM) universe. For a small school project on statistical methods, I am doing a survey amongst my peers. My target is to collect data on a sample of 30 cars, half of which are automatic, and the other half is manual. After my online survey is closed, and I have cleaned up my data, it looks like the mtcars dataset.

data(mtcars)
str(mtcars)
mtcars$am <- as.factor(mtcars$am)
levels(mtcars$am) <- c("automatic", "manual") # because anthropomorphic cars prefer factors with levels over binary code 

If I use table(mtcars$am), I find out that there were 19 automatic and 13 manual transmission cars in the dataset. Looks like I didn't make the target to have an equal number of manual and automatic cars :(! Luckily, as a car-sociologist, I can fix this by weighing my dataset. I divide the target # by the collected # to get the weight of each observation. Thus, all automatic cars should get a weight of 0.7894 (19/15) and manual cars a weight of 1.1538 (13/15). Assigning the correct weight to each observation is a fairly straightforward:

mtcars$weight <- ifelse(mtcars$am == "automatic", 0.7894737, 1.153846)

You can imagine that this method becomes a bit cumbersome with larger datasets with more weight-categories. Is there a way to automate the process of assigning the weights to each observation?

As a car and self-taught R-user who mainly cobbles things together as-needed, I don't really know where to start. I've been using the method above, but due to an enlarged number of target-groups it's not really sustainable anymore.

I of course did attempt to find the answer elsewhere on the WWW, but not very successfully unfortunately. The following question seemed promising, but doesn't provide a solution for me:

R: new variable values based on factor levels of another variable


Solution

  • Generally, you have sample proportions that exceed or fall short of the expected population proportions. So you want to weight the sample proportions to bring them in line with population proportions. You can get the weights by dividing the former by the latter.

    Let's demonstrate this by the number of carburetors provided in mtcars. Say the known/expected proportion is:

    carb_pop <- c(.25, .28, .1, .28, .05, .04) |> setNames(c(1:4, 6, 8))
    carb_pop
    #    1    2    3    4    6    8 
    # 0.25 0.28 0.10 0.28 0.05 0.04 
    

    However, in the sample we have:

    carb_smp <- table(mtcars$carb)
    proportions(carb_smp)
    #       1       2       3       4       6       8 
    # 0.21875 0.31250 0.09375 0.31250 0.03125 0.03125 
    

    Now we can create a named vector w with weights:

    w <- carb_pop/proportions(carb_smp)
    w
    #        1        2        3        4        6        8 
    # 1.142857 0.896000 1.066667 0.896000 1.600000 1.280000 
    

    that brings the proportions in line,

    all(carb_pop == w*proportions(carb_smp))
    # [1] TRUE
    

    We now can use the named vector to create weights in a match approach similar to that you've seen in your linked question.

    mtcars$weights <- w[match(mtcars$carb, names(w))]
    

    Gives

    head(mtcars)
    #                    mpg cyl disp  hp drat    wt  qsec vs am gear carb  weights
    # Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 0.896000
    # Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 0.896000
    # Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 1.142857
    # Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 1.142857
    # Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 0.896000
    # Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 1.142857