Search code examples
rdplyrdata-manipulationstandardized

Generate internal age and sex z-scores


I have the following data frame, with data from 1000 people on sex, three repeated height measures and the age at each measure.

data <- data.frame(
child_id = 1:1000,
sex = rbinom(n = 1000, size = 1, prob = 0.5),
height_5 = rnorm(1000, mean = 80, sd = 5),
height_6 = rnorm(1000, mean = 90, sd = 5),
height_7 = rnorm(1000, mean = 100, sd = 5),
age_5 = rnorm(1000, mean = 5.2, sd = 1.5),
age_6 = rnorm(1000, mean = 6.1, sd = 1.5),
age_7 = rnorm(1000, mean = 7.3, sd = 1.5)
)

data$sex <- factor(data$sex,
                 levels = c(0,1),
                 labels = c("Male", "Female"))

### Generate SOME MISSING VALUES -----
data$height_5[which(data$height_5 %in% sample(data$height_5, 25))] <- NA
data$height_6[which(data$height_6 %in% sample(data$height_6, 25))] <- NA
data$height_7[which(data$height_7 %in% sample(data$height_7, 25))] <- NA

I can generate zscores at each measurement as follows

data$ht5z <- scale(data$height_5, center = TRUE, scale = TRUE)
data$ht6z <- scale(data$height_6, center = TRUE, scale = TRUE)
data$ht7z <- scale(data$height_7, center = TRUE, scale = TRUE)

How can i generate these for each sex and year e.g. htzm3 if sex = male and age >=3 and <4, htzm4 if sex = male and age >=4 and <5 etc.


Solution

  • How about this:

    library(dplyr)
    library(stringr)
    library(tidyr)
    
    data %>% 
      gather(key, value, age_5, age_6, age_7, height_5, height_6, height_7) %>% 
      separate(key, c("key", "obs_time"), "_") %>% 
      spread(key, value) %>% 
      mutate(whole_age = floor(age)) %>% 
      group_by(sex, whole_age) %>% 
      mutate(htz = scale(height), 
             sex_init = str_to_lower(str_extract(sex, "^.")), 
             sa = paste0("htz", sex_init, whole_age)) %>% 
      ungroup() %>%
      spread(sa, htz)
    

    First we would like to put the data in a tidy format.

    To do so we first gather up all of your age and height columns into just two columns: key and value. key then takes on the name of the original variable as values, value takes on the value under the corresponding variable, and other variables are copied down as is. The data now look like this:

    # A tibble: 6,000 x 4
       child_id sex   key      value
          <int> <fct> <chr>    <dbl>
     1        1 Male  age_5     5.67
     2        1 Male  age_6     7.02
     3        1 Male  age_7     8.86
     4        1 Male  height_5 79.2 
     5        1 Male  height_6 95.8 
     6        1 Male  height_7 85.0 
     7        2 Male  age_5     3.38
     8        2 Male  age_6     5.06
     9        2 Male  age_7     5.47
    10        2 Male  height_5 79.2 
    # ... with 5,990 more rows
    

    Second, we separate the key column into two columns: key and obs_time using the "_" as the delimiter. The data now look like:

    # A tibble: 6,000 x 5
       child_id sex   key    obs_time value
          <int> <fct> <chr>  <chr>    <dbl>
     1        1 Male  age    5         5.67
     2        1 Male  age    6         7.02
     3        1 Male  age    7         8.86
     4        1 Male  height 5        79.2 
     5        1 Male  height 6        95.8 
     6        1 Male  height 7        85.0 
     7        2 Male  age    5         3.38
     8        2 Male  age    6         5.06
     9        2 Male  age    7         5.47
    10        2 Male  height 5        79.2 
    # ... with 5,990 more rows
    

    Third, we spread the values up into two variables: age and height. The data now look like:

    # A tibble: 3,000 x 5
       child_id sex   obs_time   age height
          <int> <fct> <chr>    <dbl>  <dbl>
     1        1 Male  5         5.67   79.2
     2        1 Male  6         7.02   95.8
     3        1 Male  7         8.86   85.0
     4        2 Male  5         3.38   79.2
     5        2 Male  6         5.06   81.8
     6        2 Male  7         5.47  102. 
     7        3 Male  5         5.04   80.4
     8        3 Male  6         6.37   95.3
     9        3 Male  7         7.01   97.4
    10        4 Male  5         6.25   90.8
    # ... with 2,990 more rows
    

    Fourth, through seventh, we mutate the age category whole_age and then group by sex and whole_age so that when we scale it will be applied separately for each of those groups. Then we do the scaling in each group, extract the first initial of sex and construct variable names corresponding to the freshly scaled values all in one column called sa. We can then remove the grouping. The data now look like:

    # A tibble: 3,000 x 9
       child_id sex   obs_time   age height whole_age     htz sex_init sa   
          <int> <fct> <chr>    <dbl>  <dbl>     <dbl>   <dbl> <chr>    <chr>
     1        1 Male  5         5.67   79.2         5 -0.967  m        htzm5
     2        1 Male  6         7.02   95.8         7  0.345  m        htzm7
     3        1 Male  7         8.86   85.0         8 -1.20   m        htzm8
     4        2 Male  5         3.38   79.2         3 -0.580  m        htzm3
     5        2 Male  6         5.06   81.8         5 -0.681  m        htzm5
     6        2 Male  7         5.47  102.          5  1.55   m        htzm5
     7        3 Male  5         5.04   80.4         5 -0.829  m        htzm5
     8        3 Male  6         6.37   95.3         6  0.455  m        htzm6
     9        3 Male  7         7.01   97.4         7  0.529  m        htzm7
    10        4 Male  5         6.25   90.8         6 -0.0366 m        htzm6
    # ... with 2,990 more rows
    

    Finally, we can spread the data into the variables you requested. And now we have:

    # A tibble: 3,000 x 32
       child_id sex   obs_time   age height whole_age sex_init htzf0 htzf1 htzf10 htzf11 htzf2 htzf3
          <int> <fct> <chr>    <dbl>  <dbl>     <dbl> <chr>    <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>
     1        1 Male  5         5.67   79.2         5 m           NA    NA     NA     NA    NA    NA
     2        1 Male  6         7.02   95.8         7 m           NA    NA     NA     NA    NA    NA
     3        1 Male  7         8.86   85.0         8 m           NA    NA     NA     NA    NA    NA
     4        2 Male  5         3.38   79.2         3 m           NA    NA     NA     NA    NA    NA
     5        2 Male  6         5.06   81.8         5 m           NA    NA     NA     NA    NA    NA
     6        2 Male  7         5.47  102.          5 m           NA    NA     NA     NA    NA    NA
     7        3 Male  5         5.04   80.4         5 m           NA    NA     NA     NA    NA    NA
     8        3 Male  6         6.37   95.3         6 m           NA    NA     NA     NA    NA    NA
     9        3 Male  7         7.01   97.4         7 m           NA    NA     NA     NA    NA    NA
    10        4 Male  5         6.25   90.8         6 m           NA    NA     NA     NA    NA    NA
    # ... with 2,990 more rows, and 19 more variables: htzf4 <dbl>, htzf5 <dbl>, htzf6 <dbl>,
    #   htzf7 <dbl>, htzf8 <dbl>, htzf9 <dbl>, htzm0 <dbl>, htzm1 <dbl>, htzm10 <dbl>, htzm11 <dbl>,
    #   htzm12 <dbl>, htzm2 <dbl>, htzm3 <dbl>, htzm4 <dbl>, htzm5 <dbl>, htzm6 <dbl>, htzm7 <dbl>,
    #   htzm8 <dbl>, htzm9 <dbl>