Search code examples
rvariablesconditional-statementsintervals

Create categorical variable based on numeric intervals


I have this dataframe

data<-data.frame(class1=c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
                 class2=c(1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8),
                        observations=c(444,475, 531,560,650,668,705,717,456,876,123,47,249,180,500,654))

and need to create a new categorical variable "class3" based on 2 unit intervals of "class2". If class2 is between 1 and 2, then "class3" is 1, and so on. "class2" is sequential.

I can create a new table with the defined intervals and then join.

intv<-data.frame(class2=c(1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8),
                 class3=c(1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4))

data.2<-left_join(data,intv,by = join_by(class2)) 



> data.2
   class1 class2 observations class3
1       A      1          444      1
2       A      1          444      1
3       A      2          475      1
4       A      2          475      1
5       A      3          531      2
6       A      3          531      2
7       A      4          560      2
8       A      4          560      2
9       A      5          650      3
10      A      5          650      3
11      A      6          668      3
12      A      6          668      3
13      A      7          705      4
14      A      7          705      4
15      A      8          717      4
16      A      8          717      4
17      B      1          456      1
18      B      1          456      1
19      B      2          876      1
20      B      2          876      1
21      B      3          123      2
22      B      3          123      2
23      B      4           47      2
24      B      4           47      2
25      B      5          249      3
26      B      5          249      3
27      B      6          180      3
28      B      6          180      3
29      B      7          500      4
30      B      7          500      4
31      B      8          654      4
32      B      8          654      4

But the real dataframe has lots of observations, so it would take a lot of time.

Is there a function to do so automatically just indicating the interval size?


Solution

  • For included example data, dividing by 2 and rounding up should be enough:

    data<-data.frame(class1=c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"),
                     class2=c(1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8),
                     observations=c(444,475, 531,560,650,668,705,717,456,876,123,47,249,180,500,654))
    
    data$class3 <- ceiling(data$class2 / 2)
    # if you need it to be categorical / factor :
    data$class3_fct <- as.factor(data$class3)
    
    head(data, n = 10)
    #>    class1 class2 observations class3 class3_fct
    #> 1       A      1          444      1          1
    #> 2       A      2          475      1          1
    #> 3       A      3          531      2          2
    #> 4       A      4          560      2          2
    #> 5       A      5          650      3          3
    #> 6       A      6          668      3          3
    #> 7       A      7          705      4          4
    #> 8       A      8          717      4          4
    #> 9       B      1          456      1          1
    #> 10      B      2          876      1          1
    str(data)
    #> 'data.frame':    16 obs. of  5 variables:
    #>  $ class1      : chr  "A" "A" "A" "A" ...
    #>  $ class2      : num  1 2 3 4 5 6 7 8 1 2 ...
    #>  $ observations: num  444 475 531 560 650 668 705 717 456 876 ...
    #>  $ class3      : num  1 1 2 2 3 3 4 4 1 1 ...
    #>  $ class3_fct  : Factor w/ 4 levels "1","2","3","4": 1 1 2 2 3 3 4 4 1 1 ...
    

    Created on 2024-01-19 with reprex v2.0.2