Search code examples
runiquecombinationsfrequency

Finding occurrence of character from multiple vector or list


I wish to find the number of times a unique/distinct character occurs accross mulitple vectors or from a list.

Perhaps its best to describe in an example ;

In this example, lets say the "unique character" are letters. And the muliple "vectors" are books. I wish to find the occurance of the letters as the number of book increases.

# Initial data in the format of a list
book_list <- list(book_A <- c("a", "b", "c", "z"),
                  book_B <- c("c", "d", "a"),
                  book_C <- c("b", "a", "c", "e", "x"))

# Initial data in the format of multiple vectors
book_A <- c("a", "b", "c", "z")
book_B <- c("c", "d", "a")
book_C <- c("b", "a", "c", "e", "x")

# Finding the unique letters in each book
# This is the part im struggling to code in a loop fashion
one_book <- length(unique(book_A))
two_book <- length(unique(c(book_A, book_B)))
three_book <- length(unique(c(book_A, book_B, book_C)))

# Plot the desired output
plot(x=c(1,2,3), 
     y=c(one_book, two_book, three_book), 
     ylab = "Number of unqiue letters", xlab = "Book Number",
     main="The occurence of unique letters as number of book increases")

enter image description here

To Note : The real data set is much bigger. Each vector (book_A, book_B...etc) is about 7000 in length.

I attempting to solve the problem with dplyr or data frame, but I'm not quite there yet.

# Explore data frame option with an example data
library(dplyr)
df <- read.delim("http://m.uploadedit.com/ba3s/148950223626.txt")

# Group them
df_group <- dplyr::group_by(df, book) %>% summarize(occurence = length(letter))

# Use the cummuative sum
plot(x=1:length(unique(df$book)), y=cumsum(df_group$occurence))

But I know the plot is not correct, as it is only plotting the cummulative sum rather than what I intended. Any hints would be most helpful.

To add to the complexity, it would be nice if the book which have the shortest number of letter first can be ploted. Something along the line

# Example ;
# Find the length of the letters in the book
lapply(book_list, length)

# I know that book_B is has the shortest number of letters (3);
# followed by book_A (4) then book_C (5)
one_book <- length(unique(book_B))
two_book <- length(unique(c(book_B, book_A)))
three_book <- length(unique(c(book_B, book_A, book_C)))


plot(x=c(1,2,3), 
     y=c(one_book, two_book, three_book), 
     ylab = "Number of letters", xlab = "Book Number")

Solution

  • You can use Reduce with accumulate = TRUE, i.e.

    sapply(Reduce(c, book_list, accumulate = TRUE), function(i) length(unique(i)))
    #[1] 4 5 7