Search code examples
rstringsubstring

R: table frequencies of letters in string based on Alphabet


I need to compute letter frequencies of a large list of words. For each of the locations in the word (first, second,..), I need to find how many times each letter (a-z) appeared in the list and then table the data according to the word positon.

For example, if my word list is: words <- c("swims", "seems", "gills", "draws", "which", "water")

then the result table should like that:

letter first position second position third position fourth position fifth position
a 0 1 1 0 0
b 0 0 0 0 0
c 0 0 0 1 0
d 1 0 0 0 0
e 0 1 1 1 0
f 0 0 0 0 0
...continued until z ... ... ... ... ...

All words are of same length (5).

What I have so far is:

alphabet <- letters[1:26]

words.df <- data.frame("Words" = words)

words.df <- words.df %>% mutate("First_place" = substr(words.df$words,1,1))
words.df <- words.df %>% mutate("Second_place" = substr(words.df$words,2,2))
words.df <- words.df %>% mutate("Third_place" = substr(words.df$words,3,3))
words.df <- words.df %>% mutate("Fourth_place" = substr(words.df$words,4,4))
words.df <- words.df %>% mutate("Fifth_place" = substr(words.df$words,5,5))



x1 <- words.df$First_place
x1 <- table(factor(x1,alphabet))


x2 <- words.df$Second_place
x2 <- table(factor(x2,alphabet))


x3 <- words.df$Third_place
x3 <- table(factor(x3,alphabet))


x4 <- words.df$Fourth_place
x4 <- table(factor(x4,alphabet))


x5 <- words.df$Fifth_place
x5 <- table(factor(x5,alphabet))

My code is not effective and gives tables to each letter position sepretely. All help will be appreicated.


Solution

  • in base R use table:

    table(let = unlist(strsplit(words,'')),pos = sequence(nchar(words)))
    
       pos
    let 1 2 3 4 5
      a 0 1 1 0 0
      c 0 0 0 1 0
      d 1 0 0 0 0
      e 0 1 1 1 0
      g 1 0 0 0 0
      h 0 1 0 0 1
      i 0 1 2 0 0
      l 0 0 1 1 0
      m 0 0 0 2 0
      r 0 1 0 0 1
      s 2 0 0 0 4
      t 0 0 1 0 0
      w 2 1 0 1 0
    

    Note that if you need all the values from a-z then use

    table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))

    Also to get a dataframe you could do:

    d <- table(factor(unlist(strsplit(words,'')), letters), sequence(nchar(words)))
    cbind(letters = rownames(d), as.data.frame.matrix(d))