I am trying to determine the number of verbs (n) in each tense and aspect (TA) for each text (filename) in my dataset. I have my data saved as a tibble (see below), but the values in the n column are not yet accurate because some categories subsume others. For instance, to get an accurate count of past_simple, I need to subtract the number (n) for past_perfect and present_progressive in that same text. Essentially, I am looking to do something like this:
Subtract the value (n) of past_perfect and past_progressive from the value of past_simple for filename BIO.GO.01.1
Repeat this process for each individual file
tib <- structure(
list(
TA = c(
"past_perfect",
"past_progressive",
"past_simple",
"past_simple",
"past_simple",
"past_simple",
"past_simple",
"past_simple",
"past_perfect",
"past_progressive"
),
tense = c(
"past",
"past",
"past",
"past",
"past",
"past",
"past",
"past",
"past",
"past"
),
aspect = c(
"perfect",
"progressive",
"simple",
"simple",
"simple",
"simple",
"simple",
"simple",
"perfect",
"progressive"
),
filename = c(
"BIO.G0.01.1",
"BIO.G0.01.1",
"BIO.G0.01.1",
"BIO.G0.02.1",
"BIO.G0.02.2",
"BIO.G0.02.4",
"BIO.G0.02.5",
"BIO.G0.02.6",
"BIO.G0.03.1",
"BIO.G0.03.1"
),
discipline = c(
"BIO",
"BIO",
"BIO",
"BIO",
"BIO",
"BIO",
"BIO",
"BIO",
"BIO",
"BIO"
),
nativeness = c("NS", "NS", "NS",
"NS", "NS", "NS", "NS", "NS", "NS", "NS"),
year = c("G0",
"G0", "G0", "G0", "G0", "G0", "G0", "G0", "G0", "G0"),
gender = c("F",
"F", "F", "M", "M", "M", "M", "M", "F", "F"),
n = c(2L, 2L,
57L, 39L, 3L, 4L, 49L, 103L, 1L, 1L)
),
class = c("grouped_df",
"tbl_df", "tbl", "data.frame"),
row.names = c(NA,-10L),
groups = structure(
list(
filename = c(
"BIO.G0.01.1",
"BIO.G0.02.1",
"BIO.G0.02.2",
"BIO.G0.02.4",
"BIO.G0.02.5",
"BIO.G0.02.6",
"BIO.G0.03.1"
),
discipline = c("BIO", "BIO", "BIO", "BIO", "BIO", "BIO",
"BIO"),
nativeness = c("NS", "NS", "NS", "NS", "NS", "NS",
"NS"),
year = c("G0", "G0", "G0", "G0", "G0", "G0", "G0"),
gender = c("F", "M", "M", "M", "M", "M", "F"),
.rows = structure(
list(1:3, 4L, 5L, 6L, 7L, 8L, 9:10),
ptype = integer(0),
class = c("vctrs_list_of",
"vctrs_vctr", "list")
)
),
class = c("tbl_df", "tbl", "data.frame"),
row.names = c(NA,-7L),
.drop = TRUE
)
)
I know how to subtract rows from other rows based on their position, like this:
tib[3, 9] <- tib[3, 9] - tib[2, 9] - tib[1, 9]
But the rows do not always appear in this predictable order because not all TA options are present in each text (filename). I'm also not sure how to write the code to restart this process again each time it comes across a new filename.
I am still learning how to manipulate data in R. Any suggestions would be very much appreciated!
Based on the logic showed, we may need a group by operation. It may be better to reshape to wide before doing a join if there are some elements in 'TA' missing
library(dplyr)
library(tidyr)
tib %>%
ungroup %>%
select(-tense, -aspect) %>%
pivot_wider(names_from = TA, values_from = n, values_fill = 0) %>%
mutate(n1 = past_simple - past_progressive - past_perfect,
TA = 'past_simple', .keep = 'unused') %>%
left_join(tib %>% ungroup, .) %>%
mutate(n = coalesce(n1, n), .keep = 'unused')
-output
# A tibble: 10 × 9
TA tense aspect filename discipline nativeness year gender n
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
1 past_perfect past perfect BIO.G0.01.1 BIO NS G0 F 2
2 past_progressive past progressive BIO.G0.01.1 BIO NS G0 F 2
3 past_simple past simple BIO.G0.01.1 BIO NS G0 F 53
4 past_simple past simple BIO.G0.02.1 BIO NS G0 M 39
5 past_simple past simple BIO.G0.02.2 BIO NS G0 M 3
6 past_simple past simple BIO.G0.02.4 BIO NS G0 M 4
7 past_simple past simple BIO.G0.02.5 BIO NS G0 M 49
8 past_simple past simple BIO.G0.02.6 BIO NS G0 M 103
9 past_perfect past perfect BIO.G0.03.1 BIO NS G0 F 1
10 past_progressive past progressive BIO.G0.03.1 BIO NS G0 F 1