So I have a dataset of life expectancy values between 30 and 100. I want to normalise them between 0-1, but I want to do it unevenly?
Basically I have four defined interval/breakpoint values creating 5 classes:
Breakpoints | Age | Normalised value |
---|---|---|
Min | 30 | 0 |
BP1 | 57 | 0.2 |
BP2 | 62 | 0.4 |
BP3 | 66 | 0.6 |
BP4 | 71 | 0.8 |
Max | 100 | 1 |
I can easily reclassify them into those five classes, but I don't know to calculate the normalised values using those breakpoints. All five classes have a range of 0.2, but the age range in each class will be different, e.g. class one, 0-0.2, has an age range of 27 years, but category two, 0.2-0.4, has a range of just 5 years.
Example data:
ages <- floor(runif(50, min = 30, max = 100))
Edit:
So on this graph, x is life expectancy in years, and y is the normalised values - I want to calculate the exact value of y for each x value.
One standard approach is to use logistic regression:
glmfit <- glm(Normalised_value~Age,data=df,family = binomial())
plot(glmfit)
predict(glmfit,newdata=data.frame(Age=30:100),type="response")
1 2 3 4 5 6 7
0.001038676 0.001270899 0.001554960 0.001902391 0.002327270 0.002846769 0.003481828
8 9 10 11 12 13 14
0.004257952 0.005206175 0.006364213 0.007777825 0.009502425 0.011604952 0.014166036
15 16 17 18 19 20 21
0.017282440 0.021069773 0.025665399 0.031231419 0.037957508 0.046063272 0.055799610
22 23 24 25 26 27 28
0.067448397 0.081319557 0.097744402 0.117063988 0.139611296 0.165686430 0.195524878
29 30 31 32 33 34 35
0.229260322 0.266885403 0.308216030 0.352866476 0.400242921 0.449561353 0.499891678
36 37 38 39 40 41 42
0.550224198 0.599549040 0.646935614 0.691599168 0.732945010 0.770586518 0.804338777
43 44 45 46 47 48 49
0.834193745 0.860284578 0.882846413 0.902179147 0.918615680 0.932497075 0.944154716
50 51 52 53 54 55 56
0.953898634 0.962010835 0.968742352 0.974312922 0.978912346 0.982702836 0.985821857
57 58 59 60 61 62 63
0.988385104 0.990489415 0.992215484 0.993630305 0.994789335 0.995738372 0.996515163
64 65 66 67 68 69 70
0.997150770 0.997670717 0.998095963 0.998443694 0.998728001 0.998960424 0.999150415
71
0.999305707
plot(df$Age,df$Normalised_value)
lines(30:100,predict(glmfit,newdata=data.frame(Age=30:100),type="response"))
Original:
We can utilize splines::bs
and lm
, which creates piecewise linear spline regression:
library(splines)
spline_fit <- lm(Normalised_value ~ bs(Age, knots=c(57,62,66,71),degree=1),data=df)
newdata <- data.frame(Age=30:100)
newdata$normalized <- predict(spline_fit,newdata=newdata)
newdata
Age normalized
1 30 1.416396e-16
2 31 7.407407e-03
3 32 1.481481e-02
4 33 2.222222e-02
5 34 2.962963e-02
6 35 3.703704e-02
7 36 4.444444e-02
8 37 5.185185e-02
9 38 5.925926e-02
10 39 6.666667e-02
11 40 7.407407e-02
12 41 8.148148e-02
13 42 8.888889e-02
14 43 9.629630e-02
15 44 1.037037e-01
16 45 1.111111e-01
17 46 1.185185e-01
18 47 1.259259e-01
19 48 1.333333e-01
20 49 1.407407e-01
21 50 1.481481e-01
22 51 1.555556e-01
23 52 1.629630e-01
24 53 1.703704e-01
25 54 1.777778e-01
26 55 1.851852e-01
27 56 1.925926e-01
28 57 2.000000e-01
29 58 2.400000e-01
30 59 2.800000e-01
31 60 3.200000e-01
32 61 3.600000e-01
33 62 4.000000e-01
34 63 4.500000e-01
35 64 5.000000e-01
36 65 5.500000e-01
37 66 6.000000e-01
38 67 6.400000e-01
39 68 6.800000e-01
40 69 7.200000e-01
41 70 7.600000e-01
42 71 8.000000e-01
43 72 8.068966e-01
44 73 8.137931e-01
45 74 8.206897e-01
46 75 8.275862e-01
47 76 8.344828e-01
48 77 8.413793e-01
49 78 8.482759e-01
50 79 8.551724e-01
51 80 8.620690e-01
52 81 8.689655e-01
53 82 8.758621e-01
54 83 8.827586e-01
55 84 8.896552e-01
56 85 8.965517e-01
57 86 9.034483e-01
58 87 9.103448e-01
59 88 9.172414e-01
60 89 9.241379e-01
61 90 9.310345e-01
62 91 9.379310e-01
63 92 9.448276e-01
64 93 9.517241e-01
65 94 9.586207e-01
66 95 9.655172e-01
67 96 9.724138e-01
68 97 9.793103e-01
69 98 9.862069e-01
70 99 9.931034e-01
71 100 1.000000e+00