Search code examples
rdplyrtidyverseforcats

How can I filter rows where there are more than 3 observations?


I have a simple dataset, and I'm trying to find cities with more than 3 observations (n). However, I'm encountering an error when using the fct_lump() function. Could you help me identify the issue?

tablo1 |> 
  count(sehir, sort = TRUE) 
sehir              n
   <chr>          <int>
 1 Adana              2
 2 Adıyaman           1
 3 Afyonkarahisar     2
 4 Aksaray            1
 5 Amasya             1
 6 Ankara            23
 7 Antalya            5
 8 Ardahan            1
 9 Artvin             1
10 Aydın              1
# ℹ 71 more rows
# ℹ Use `print(n = ...)` to see more rows

Here's the current code that results in an error:

tablo1 |> 
  count(sehir) |>
  filter(fct_lump(sehir, 5, w = n))  

The error message I'm receiving is:

Error in `filter()`:
ℹ In argument: `fct_lump(sehir, 5, w = n)`.
Caused by error:
! `..1` must be a logical vector, not a <factor> object.
Run `rlang::last_trace()` to see where the error occurred. 

What am I doing wrong?

rlang::last_trace()
<error/rlang_error>
Error in `filter()`:
ℹ In argument: `fct_lump(sehir, 5, w = n)`.
Caused by error:
! `..1` must be a logical vector, not a <factor> object.
---
Backtrace:
    ▆
 1. ├─dplyr::filter(count(tablo1, sehir), fct_lump(sehir, 5, w = n))
 2. ├─dplyr:::filter.data.frame(count(tablo1, sehir), fct_lump(sehir, 5, w = n))
 3. │ └─dplyr:::filter_rows(.data, dots, by)
 4. │   └─dplyr:::filter_eval(...)
 5. │     ├─base::withCallingHandlers(...)
 6. │     └─mask$eval_all_filter(dots, env_filter)
 7. │       └─dplyr (local) eval()
 8. └─dplyr:::dplyr_internal_error(...)
Run rlang::last_trace(drop = FALSE) to see 5 hidden frames. 

Solution

  • For fct_lump & co you might want to start with uncounted values; with fct_lump_min(..., min = 4) you'd be left with factor levels with "more than 3 observations" + Other which you can then count:

    library(dplyr, warn.conflicts = FALSE)
    library(forcats)
    
    # uncount first to get "original" dataset
    tablo1 <- read.table(header = TRUE, text="
    sehir              n
    1 Adana              2
    2 Adıyaman           1
    3 Afyonkarahisar     2
    4 Aksaray            1
    5 Amasya             1
    6 Ankara            23
    7 Antalya            5
    8 Ardahan            1
    9 Artvin             1
    10 Aydın              1") |>
      tidyr::uncount(n) |>
      as_tibble()
    glimpse(tablo1)
    #> Rows: 38
    #> Columns: 1
    #> $ sehir <chr> "Adana", "Adana", "Adıyaman", "Afyonkarahisar", "Afyonkarahisar"…
    
    tablo1 |>
      mutate(sehir = fct_lump_min(sehir, 4)) |>
      count(sehir)
    #> # A tibble: 3 × 2
    #>   sehir       n
    #>   <fct>   <int>
    #> 1 Ankara     23
    #> 2 Antalya     5
    #> 3 Other      10
    

    Created on 2024-02-01 with reprex v2.0.2