Search code examples
rif-statementdplyrcase-when

How best to create calculated columns in R


Below is the sample data. The task at hand is creating two new columns that would designate something by zip code. The first new column would be titled Las_Vegas and the second would be Laughlin. The first eight zip codes would have a value of 1 for Las Vegas and the second eight would have a value of 1 for Laughlin. The purpose of this is that I want to sum the employment for Las Vegas and Laughlin.

First question: Would it be best to use ifelse or case_when? Second question: Making the two new columns into defacto dummy variables... is this the best approach?

  zipcode <-c(89102,89103,89104,89105,89106,89107,89108,89109,89110,89111,89112,89113,89114,89115,89116,89117)
  naicstest<-c(541213,541213,541213,541213,541213,541213,541213,541213,541213,541213,541213,541213,541213,541212,541215,541214)
  emptest <-c(2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32)


  county <- data.frame(zipcode,naicstest,emptest)

End result. This end result would have sixteen rows. I kept it short for sake of simplicity. one row for Las_Vegas and one row for Laughlin but there would be eight rows for Las_Vegas and eight for Laughlin. I know how to do the summarise (summing employment) but struggling how to make the two columns.

  zipcode     naicstest     emptest    Las_Vegas     Laughlin
    89102       541213         2           1             0
    89110       541213         18            0             1

Solution

  • We can use tidyverse

    1. We match the 'zipcode' by unique(zipcode) to get a numeric index for each unique zipcode.
    2. Use the index from 1 to create another index for every 8 elements with %/%
    3. The index from 2 is used as position index replacing with vector of values
    4. Use the output from 3 as a grouping variable
    5. Get the first row for each group - slice_head with n = 1
    6. Reshape from 'long' to 'wide' with pivot_wider
    library(dplyr)
    library(tidyr)
    county %>%
        group_by(un1 = c("Las_Vegas", "Laughlin")[
            (match(zipcode, unique(zipcode)) -1) %/% 8 + 1]) %>%
        slice_head(n = 1) %>%
        mutate(n = 1) %>% 
        pivot_wider(names_from = un1, values_from = n, values_fill = 0)
    

    -output

    # A tibble: 2 x 5
      zipcode naicstest emptest Las_Vegas Laughlin
        <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
    1   89102    541213       2         1        0
    2   89110    541213      18         0        1
    

    If we want to return all the rows, then don't do the slice_head, instead create a sequence column - row_number()

    county %>%
        group_by(un1 = c("Las_Vegas", "Laughlin")[
            (match(zipcode, unique(zipcode)) -1) %/% 8 + 1]) %>%    
        mutate(n = 1, rn = row_number()) %>%
        ungroup %>% 
        pivot_wider(names_from = un1, values_from = n, values_fill = 0) %>% 
        select(-rn)
    

    -ouptut

    # A tibble: 16 x 5
       zipcode naicstest emptest Las_Vegas Laughlin
         <dbl>     <dbl>   <dbl>     <dbl>    <dbl>
     1   89102    541213       2         1        0
     2   89103    541213       4         1        0
     3   89104    541213       6         1        0
     4   89105    541213       8         1        0
     5   89106    541213      10         1        0
     6   89107    541213      12         1        0
     7   89108    541213      14         1        0
     8   89109    541213      16         1        0
     9   89110    541213      18         0        1
    10   89111    541213      20         0        1
    11   89112    541213      22         0        1
    12   89113    541213      24         0        1
    13   89114    541213      26         0        1
    14   89115    541212      28         0        1
    15   89116    541215      30         0        1
    16   89117    541214      32         0        1