Search code examples
rdataframefrequencycontingency

select most frequent element in dataframe while using table


I have a list of data frames on which I want to use table. The list looks like this:

pronouns <- data.frame(pronounciation = c("juː","juː","juː","ju","ju","jə","jə","hɪm","hɪm","hɪm", "həm","ðɛm"), words = c("you","you","you","you","you","you","you","him","him","him","him","them"))
articles <- data.frame(pronounciation = c("ðiː","ði","ði","ðə","ðə","ði","ðə","eɪ","eɪ","æɪ","æɪ","eɪ","eɪ","eɪ","e"), words = c("the","the","the","the","the","the","the","a","a","a","a","a","a","a","a"))
numbers <- data.frame(pronounciation = c("wʌn","wʌn","wʌn","wʌn","wan","wa:n","tuː","tuː","tuː","tuː","tu","tu","tuː","tuː","θɹiː"), words = c("one","one","one","one","one","one","two","two","two","two","two","two","two","two","three"))
ls <- list(pronouns, articles, numbers)

ls[[1]]
   pronounciation words
1             juː   you
2             juː   you
3             juː   you
4              ju   you
5              ju   you
6              jə   you
7              jə   you
8             hɪm   him
9             hɪm   him
10            hɪm   him
11            həm   him
12            ðɛm  them

From this list of dataframes, I want to extract contingency tables for $words using table(), but also select the most common pronunciation of each word at the same time. The required result is in ls_out:

pronouns_out <- data.frame(pronounciation = c("juː","hɪm","ðɛm"), words = c("you","him","them"), occurence = c(7,4,1))
articles_out <- data.frame(pronounciation = c("ði","eɪ"), words = c("the","a"), occurence = c(7,8))
numbers_out <- data.frame(pronounciation = c("wʌn","tuː","θɹiː"), words = c("one","two","three"), occurence = c(6,8,1))
ls_out <- list(pronouns_out, articles_out, numbers_out)

ls_out[[1]]
  pronounciation words occurence
1            juː   you         7
2            hɪm   him         4
3            ðɛm  them         1

If the two or more pronunciations have the same frequency (like ði and ðə in ls[[2]]), a random selection of one pronunciation needs to be made.

Any advice on how to this is very welcome.


Solution

  • Using table (and lapply):

    ff = function(pronounce, word) 
    {
        tab = table(word, pronounce)
        data.frame(pronounciation = colnames(tab)[max.col(tab, "random")], 
                   words = rownames(tab),
                   occurences = unname(rowSums(tab)))
    }
    
    lapply(ls, function(x) ff(x$pronounciation, x$words))
    
    #[[1]]
    #     pronounciation words occurences
    #1        h<U+026A>m   him          4
    #2 <U+00F0><U+025B>m  them          1
    #3        ju<U+02D0>   you          7
    #
    #[[2]]
    #  pronounciation words occurences
    #1      e<U+026A>     a          8
    #2      <U+00F0>i   the          7
    #
    #[[3]]
    #      pronounciation words occurences
    #1         w<U+028C>n   one          6
    #2 θ<U+0279>i<U+02D0> three          1
    #3         tu<U+02D0>   two          8