Search code examples
raggregatesummaryr-factor

Syntax for referencing into results of summary frequency counts of categorical variable/factor


I am very stuck on a basic question about summarising categorical data. My raw data consists of multiple records of the form UserId, ItemId, CategoryID. For each ItemID there is a fixed CategoryID. For each UserID, there is a fixed GroupID. There can be an artibrary number of entries for each UserId, but only one per ItemID. At the moment when I am reading in the data from .csv I am setting every column as a factor.

Here is a toy data set:

uIDs <- c("1", "1", "3", "8", "3", "8", "6")
iIDs <- c("a", "c", "d", "d", "e", "f", "g")
cIDs <- c("V", "V", "A", "A", "A", "A", "M")
gIDs <- c("U", "U", "N", "U", "N", "U", "P")
foo <- data.frame(uID = uIDs, iID = iIDs, cID = cIDs, gID = gIDs)

From this data set I need to extract, in usable form, various summaries, such as:

  • for each uID, how many iIDs are there?
  • for each uID, how many cIDs are there?
  • for each iIDs, how many uIDs are there?
  • for each cID, how many uIDs are there?
  • for each cID, how many gIDs are there?
  • for each gID, how many cIDs are there?

Very straightforward stuff, but I have spent most of the day struggling with it. I am particularly confused by the various ways in which output is returned, in the various functions which can be used to help with this (aggregate, summary, by, table, and friends). Let's take as an example, summary. Its output looks really useful. But I can't figure out how to get at it.

     summary(foo)
 uID    iID   cID   gID  
  8:1   a:1   A:4   N:2  
 1 :2   c:1   M:1   P:1  
 3 :2   d:2   V:2   U:4  
 6 :1   e:1              
 8 :1   f:1              
        g:1

When I ask the result what it is, the result is very complex and I don't know how to strip it down to get at what I want.

    > str(summary(foo))
 'table' chr [1:6, 1:4] " 8:1  " "1 :2  " "3 :2  " "6 :1  " ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:6] "" "" "" "" ...
  ..$ : chr [1:4] "uID" "iID" "cID" "gID"

Given my needs, which are simple, what is the most straightforward way of asking my question so that I can get a result I can easily manipulate further?

thanks!

p.s. sorry if the code pasting isn't in the right format - trying to paste in from Rstudio but it doesn't look right - advice welcome (tried to search for advice didn't find anything but I know it's there somewhere as I read it about 6 months ago...)


Solution

  • You can answer most of those questions like so:

    • for each uID, how many iIDs are there?

    with(foo, rowSums(table(uID, iID)))

    1 3 6 8 
    2 2 1 2 
    

    NB I think there is a slight error in your example data.. one of your uID is " 8" rather than "8" which confused me for a bit.