Search code examples
rdistincttapply

R count distinct character of days ( n_distinct, nlevels(as.factor()) str_count() are not working)


> test
# A tibble: 30 × 2
# Groups:   Week [30]
    Week Dates                                                                                                                 
   <dbl> <chr>                                                                                                                 
 1     2 2023-10-04, 2023-10-05, 2023-10-05, 2023-10-06, 2023-10-06, 2023-10-06, 2023-10-08, 2023-10-08                        
 2     3 2023-10-11, 2023-10-12, 2023-10-12, 2023-10-14, 2023-10-15                                                            
 3     4 2023-10-18, 2023-10-19, 2023-10-20, 2023-10-20, 2023-10-21, 2023-10-21, 2023-10-22, 2023-10-22                        
 4     5 2023-10-25, 2023-10-25, 2023-10-26, 2023-10-27, 2023-10-28, 2023-10-29, 2023-10-29, 2023-10-30                        
 5     6 2023-11-01, 2023-11-01, 2023-11-01, 2023-11-01, 2023-11-02, 2023-11-02, 2023-11-03, 2023-11-04, 2023-11-05, 2023-11-05
 6     7 2023-11-09, 2023-11-10, 2023-11-13                                                                                    
 7     8 2023-11-16, 2023-11-17, 2023-11-18, 2023-11-19, 2023-11-21                                                            
 8     9 2023-11-22, 2023-11-22, 2023-11-23                                                                                    
 9    10 2023-11-29, 2023-11-30, 2023-12-02, 2023-12-03, 2023-12-04                                                            
10    11 2023-12-06, 2023-12-07, 2023-12-08, 2023-12-08, 2023-12-09, 2023-12-10, 2023-12-10                                    
# ℹ 20 more rows

Dated are pasted with comma then it's saved as characters in data set of 'test' I need to count the unique date of each week.
For example, the number of counted dates for week2 should be 4: 2023-10-04,2023-10-05,2023-10-06, 2023-10-08 and the number of counted dates for week3 should be 4: 2023-10-11,2023-10-12,2023-10-14, 2023-10-15 so and so forth.

but I tried with

> with(test, tapply(Dates, Week, function(x) nlevels(unique(as.factor(x)))))
 2  3  4  5  6  7  8  9 10 11 12 13 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 
 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 
> with(test, sapply(Dates, function(x) nlevels(unique(as.factor(x)))))
                                    2023-10-04, 2023-10-05, 2023-10-05, 2023-10-06, 2023-10-06, 2023-10-06, 2023-10-08, 2023-10-08 
                                                                                                                                 1 
                                                                        2023-10-11, 2023-10-12, 2023-10-12, 2023-10-14, 2023-10-15 
                                                                                                                                 1 
                                    2023-10-18, 2023-10-19, 2023-10-20, 2023-10-20, 2023-10-21, 2023-10-21, 2023-10-22, 2023-10-22 
                                                                                                                                 1 
                                    2023-10-25, 2023-10-25, 2023-10-26, 2023-10-27, 2023-10-28, 2023-10-29, 2023-10-29, 2023-10-30 
                                                                                                                                 1 
            2023-11-01, 2023-11-01, 2023-11-01, 2023-11-01, 2023-11-02, 2023-11-02, 2023-11-03, 2023-11-04, 2023-11-05, 2023-11-05 
                                                                                                                                 1 
                                                                                                2023-11-09, 2023-11-10, 2023-11-13 
                                                                                                                                 1 
> n_distinct(unique(as.factor(test$Dates[1])))
[1] 1

it all recognize as one chunk.

> unique(factor(str_split(test$Dates[1], ',')))
[1] c("2023-10-04", " 2023-10-05", " 2023-10-05", " 2023-10-06", " 2023-10-06", " 2023-10-06", " 2023-10-08", " 2023-10-08")
Levels: c("2023-10-04", " 2023-10-05", " 2023-10-05", " 2023-10-06", " 2023-10-06", " 2023-10-06", " 2023-10-08", " 2023-10-08")
> unique(str_split(test$Dates[1], ','))
[[1]]
[1] "2023-10-04"  " 2023-10-05" " 2023-10-05" " 2023-10-06" " 2023-10-06" " 2023-10-06" " 2023-10-08" " 2023-10-08"

> nlevels(factor(str_split(test$Dates[1], ',')))
[1] 1

nor string split can't recognize as distinct(unique) counts


Solution

  • Example data:

    x <- c(
        "2023-10-04, 2023-10-05, 2023-10-05, 2023-10-06, 2023-10-06, 2023-10-06, 2023-10-08, 2023-10-08",
        "2023-10-11, 2023-10-12, 2023-10-12, 2023-10-14, 2023-10-15"
    )
    

    Count e.g. like this:

    x |> strsplit(', ') |> sapply(\(x) length(unique(x)))
    

    Or using tidyverse:

    x |> str_split(', ') |> map_int(n_distinct)
    

    Both give

    [1] 4 4