Search code examples
rldamass

How do I find the scores at which the LDA function from MASS specifies to which class an observation belongs?


I have a dataset of body measurements for birds and I'm using the lda function from the MASS package to find out the extent of sexual dimorphism. Eventually, I want to end up with an equation and critical score that can be used in the field (without access to computers or R) to determine if the bird in hand is male or female. In our data set, there are more males than females. I don't know exactly why that is, but for now, I'm assuming this means there is a real reason why males are captured more often than females, though our dataset is only 34 birds so this might not be significant.

I know how to extract/determine the equation (following the instructions halfway down the page here: https://stats.stackexchange.com/questions/157772/how-to-find-the-line) but there is some overlap in the D-scores where the predict.lda function seems to go either way. I expected the critical D-score to be 0 but it's not...

I would like to know how I can find 1) the D-score where the model will always determine the bird is female (or male), 2) what the extent of the overlap is.

Mock code (with the real data there is more overlap):

set.seed(42) 

train <- data.frame(sex = c(rep("F", 35), rep("M", 65)),
                   A = c(rnorm(35, 20, 2.5), rnorm(65, 15, 2.5)),
                   B = c(rnorm(35, 6, 0.2), rnorm(65, 5.5, 0.2)),
                   C = c(rnorm(35, 250, 5), rnorm(65, 240, 5)),
                   D = c(rnorm(35, 450, 25), rnorm(65, 350, 25)))

mod <- lda(sex ~ ., data = train)
mod

gm = mod$prior %*% mod$means # these are used to get the equation
const = drop(gm %*% mod$scaling)

#the equation is then: D = mod$scaling[1] * A + mod$scaling[2] * B + mod$scaling[3] * C + mod$scaling[4] * D - const

test <- data.frame(sex = c(rep("F", 350), rep("M", 650)),
                  A = rnorm(1000, gm[1], 2.5),
                  B = rnorm(1000, gm[2], 0.2),
                  C = rnorm(1000, gm[3], 5),
                  D = rnorm(1000, gm[4], 25))

pred <- data.frame(predict(mod, test)$x, class = predict(mod, test)$class)

enter image description here

I've Googled a lot and looked at several stack exchange and stack overflow questions, but I can't figure it out.


Solution

  • For your example data the quantiles for male and female:

    by(D, train$sex, quantile)
    # train$sex: F
    #        0%       25%       50%       75%      100% 
    # -6.271599 -4.489364 -3.770150 -3.017528 -1.327032 
    # ----------------------------------------------------------------------------
    # train$sex: M
    #         0%        25%        50%        75%       100% 
    # -0.8563099  1.5266578  1.9219727  2.7991112  3.8717447 
    

    There is no overlap for this example. D values less than -1.327 are always female and values greater than -.856 are always male. If the ranges overlap, then you will have to decide whether to flip a coin or record them as uncertain.

    You can get a more detailed view by looking at the posterior probabilities:

    pred.tr <- as.data.frame(predict(mod))
    idx <- order(pred.tr$LD1)
    pred.srt <- pred.tr[idx, ]
    pred.srt
    #     class  posterior.F  posterior.M        LD1
    # 4       F 1.000000e+00 3.895671e-14 -6.2715995
    # 25      F 1.000000e+00 7.087004e-14 -6.1690763
    # 35      F 1.000000e+00 5.234647e-12 -5.4319799
    # 2       F 1.000000e+00 9.615516e-11 -4.9332964
    # 18      F 1.000000e+00 1.017526e-10 -4.9236025
    #  . . . .
    # 13      F 9.996574e-01 3.426315e-04 -2.3485213
    # 28      F 9.996073e-01 3.926946e-04 -2.3251473
    # 19      F 8.825072e-01 1.174928e-01 -1.3270319 # <- Last female
    # 81      M 3.249597e-01 6.750403e-01 -0.8563099 # <- First male
    # 80      M 2.324926e-04 9.997675e-01  0.4518529
    # 46      M 2.247020e-04 9.997753e-01  0.4576938
    # . . . .
    # 36      M 1.282832e-11 1.000000e+00  3.3152791
    # 39      M 2.153913e-12 1.000000e+00  3.6209947
    # 52      M 1.169887e-12 1.000000e+00  3.7255708
    # 82      M 8.625676e-13 1.000000e+00  3.7777833
    # 59      M 4.984432e-13 1.000000e+00  3.8717447
    

    You could also use the test data instead of the training data, to see if the boundary between male and female is fuzzier than the training data suggest. The posterior probabilities indicate that for LD1 values less than -1.327 the probability of being female is essentially 100%. For values of -.856 the probability of being male is 67.5% and by .452 and above it is essentially 100%.