Search code examples
rvectorunique

Number of different elements up to this point


I've got a relatively simple problem (I think) and I want to solve it in a fast and efficient way.

I want to count the number of different elements in a vector up to each point in this vector.

For example, in a vector like this

vec <- c("a", "b", "c", "a", "a", "c", "d", "a")

I want to get the following vector of equal size as a result: [1 2 3 3 3 3 4 4]

I could solve this of course with a for loop in combination with cumsum():

vec <- c("a", "b", "c", "a", "a", "c", "d", "a")
res <- T
for (i in 2:length(vec)) {
  res[i] <- !(vec[i] %in% vec[1:(i-1)])
}
cumsum(res)
[1] 1 2 3 3 3 3 4 4

However, I am dealing with vectors that have several million elements and a for-loop approach takes forever for such a relatively simple problem.

I have the intuition that this should be solvable much faster and more clever. Do you have any ideas? Thank you!

(In case you're interested: I need this for a vocabulary growth curve analysis where we want to know at each point in the text how many different words, i.e. types, have been observed so far.)


Solution

  • Use cumsum on the non (!) duplicated values:

    cumsum(!duplicated(vec))
    #[1] 1 2 3 3 3 3 4 4
    

    And another approach with match:

    uni <- vector(length = length(vec))
    uni[match(unique(vec), vec)] <- TRUE
    cumsum(uni)