Search code examples
rsubsetcalculated-columnstibblemutate

How to subtract the value of some rows from other rows in R according to categorical variables


I am trying to determine the number of verbs (n) in each tense and aspect (TA) for each text (filename) in my dataset. I have my data saved as a tibble (see below), but the values in the n column are not yet accurate because some categories subsume others. For instance, to get an accurate count of past_simple, I need to subtract the number (n) for past_perfect and present_progressive in that same text. Essentially, I am looking to do something like this:

  1. Subtract the value (n) of past_perfect and past_progressive from the value of past_simple for filename BIO.GO.01.1

  2. Repeat this process for each individual file

    tib <- structure(
  list(
    TA = c(
      "past_perfect",
      "past_progressive",
      "past_simple",
      "past_simple",
      "past_simple",
      "past_simple",
      "past_simple",
      "past_simple",
      "past_perfect",
      "past_progressive"
    ),
    tense = c(
      "past",
      "past",
      "past",
      "past",
      "past",
      "past",
      "past",
      "past",
      "past",
      "past"
    ),
    aspect = c(
      "perfect",
      "progressive",
      "simple",
      "simple",
      "simple",
      "simple",
      "simple",
      "simple",
      "perfect",
      "progressive"
    ),
    filename = c(
      "BIO.G0.01.1",
      "BIO.G0.01.1",
      "BIO.G0.01.1",
      "BIO.G0.02.1",
      "BIO.G0.02.2",
      "BIO.G0.02.4",
      "BIO.G0.02.5",
      "BIO.G0.02.6",
      "BIO.G0.03.1",
      "BIO.G0.03.1"
    ),
    discipline = c(
      "BIO",
      "BIO",
      "BIO",
      "BIO",
      "BIO",
      "BIO",
      "BIO",
      "BIO",
      "BIO",
      "BIO"
    ),
    nativeness = c("NS", "NS", "NS",
                   "NS", "NS", "NS", "NS", "NS", "NS", "NS"),
    year = c("G0",
             "G0", "G0", "G0", "G0", "G0", "G0", "G0", "G0", "G0"),
    gender = c("F",
               "F", "F", "M", "M", "M", "M", "M", "F", "F"),
    n = c(2L, 2L,
          57L, 39L, 3L, 4L, 49L, 103L, 1L, 1L)
  ),
  class = c("grouped_df",
            "tbl_df", "tbl", "data.frame"),
  row.names = c(NA,-10L),
  groups = structure(
    list(
      filename = c(
        "BIO.G0.01.1",
        "BIO.G0.02.1",
        "BIO.G0.02.2",
        "BIO.G0.02.4",
        "BIO.G0.02.5",
        "BIO.G0.02.6",
        "BIO.G0.03.1"
      ),
      discipline = c("BIO", "BIO", "BIO", "BIO", "BIO", "BIO",
                     "BIO"),
      nativeness = c("NS", "NS", "NS", "NS", "NS", "NS",
                     "NS"),
      year = c("G0", "G0", "G0", "G0", "G0", "G0", "G0"),
      gender = c("F", "M", "M", "M", "M", "M", "F"),
      .rows = structure(
        list(1:3, 4L, 5L, 6L, 7L, 8L, 9:10),
        ptype = integer(0),
        class = c("vctrs_list_of",
                  "vctrs_vctr", "list")
      )
    ),
    class = c("tbl_df", "tbl", "data.frame"),
    row.names = c(NA,-7L),
    .drop = TRUE
  )
)

I know how to subtract rows from other rows based on their position, like this:

tib[3, 9] <- tib[3, 9] - tib[2, 9] - tib[1, 9]

But the rows do not always appear in this predictable order because not all TA options are present in each text (filename). I'm also not sure how to write the code to restart this process again each time it comes across a new filename.

I am still learning how to manipulate data in R. Any suggestions would be very much appreciated!


Solution

  • Based on the logic showed, we may need a group by operation. It may be better to reshape to wide before doing a join if there are some elements in 'TA' missing

    library(dplyr)
    library(tidyr)
    tib %>%
       ungroup %>% 
       select(-tense, -aspect) %>%
       pivot_wider(names_from = TA, values_from = n, values_fill = 0) %>% 
       mutate(n1 = past_simple - past_progressive - past_perfect,  
           TA = 'past_simple', .keep = 'unused') %>% 
       left_join(tib %>% ungroup, .) %>% 
       mutate(n = coalesce(n1, n), .keep = 'unused')
    

    -output

    # A tibble: 10 × 9
       TA               tense aspect      filename    discipline nativeness year  gender     n
       <chr>            <chr> <chr>       <chr>       <chr>      <chr>      <chr> <chr>  <int>
     1 past_perfect     past  perfect     BIO.G0.01.1 BIO        NS         G0    F          2
     2 past_progressive past  progressive BIO.G0.01.1 BIO        NS         G0    F          2
     3 past_simple      past  simple      BIO.G0.01.1 BIO        NS         G0    F         53
     4 past_simple      past  simple      BIO.G0.02.1 BIO        NS         G0    M         39
     5 past_simple      past  simple      BIO.G0.02.2 BIO        NS         G0    M          3
     6 past_simple      past  simple      BIO.G0.02.4 BIO        NS         G0    M          4
     7 past_simple      past  simple      BIO.G0.02.5 BIO        NS         G0    M         49
     8 past_simple      past  simple      BIO.G0.02.6 BIO        NS         G0    M        103
     9 past_perfect     past  perfect     BIO.G0.03.1 BIO        NS         G0    F          1
    10 past_progressive past  progressive BIO.G0.03.1 BIO        NS         G0    F          1