Search code examples
rdata-manipulationdata-munging

Converting Summary Table of Binary Outcome to Long Tidy DataFrame


I want to convert a table that has several categorical variables, as well as summary of the result of a binary experiment to long format to easily run a logistic regression model.

Is there an easy way to do this that does not involve just making a bunch of vectors with rep() and then combining those into a dataframe? Ideally, I'd like one function that does this automatically, but maybe I'll just need to make my own.

For example, if I start with this summary table:

test   group    success  n 
A      control  1        2
A      treat    2        3
B      control  3        5
B      treat    1        3

I want to be able to switch it back to the following format:

test   group     success
A      control   1
A      control   0
A      treat     1
A      treat     1
A      treat     0
B      control   1
B      control   1
B      control   1
B      control   0
B      control   0
B      treat     1
B      treat     0
B      treat     0

Thanks!


Solution

  • The reshape package is your friend, here. In this case, melt() and untable() are useful for normalizing the data.

    If the example summary data.frame is in a variable called df, an abbreviated answer is:

    # replace total n with number of failures
    df$fail = df$n - df$success
    df$n = NULL
    
    # melt and untable the data.frame
    df = melt(df)
    df = untable(df, df$value)
    
    # recode the results, e.g., here by creating a new data.frame
    df = data.frame(
      test = df$test, 
      group = df$group, 
      success = as.numeric(df$variable == "success")
    )
    

    This is a great example of a very general problem. The idea is to back calculate the list of data that underlies a cross-tabulation. Given the cross-tabulation, the back-calculated list of data has one row for each datum and contains the attributes of each datum. Here is a post to the inverse of this question.

    In "data geek" parlance, this is a question of putting tabulated data in First Normal Form -- if that is helpful to anyone. You can google data normalization, which will help you design agile data.frames that can be cross-tabulated and analyzed in many different ways.

    In detail, for melt() and untable() to work here, the raw data need to be tweaked a bit to include fail (number of failures) rather than total n, but that is simple enough:

    df$fail <- df$n - df$success
    df$n <- NULL
    

    which gives:

      test   group success fail
    1    A control       1    1
    2    A   treat       2    1
    3    B control       3    2
    4    B   treat       1    2
    

    Now we can "melt" the table. melt() can back-calculate the original list of data that was used to create a cross tabulation.

    df <- melt(df)
    

    In this case, we get new column called variable that contains either "success" or "fail", and a column called value that contains the datum from the original success or fail column.

      test   group variable value
    1    A control  success     1
    2    A   treat  success     2
    3    B control  success     3
    4    B   treat  success     1
    5    A control     fail     1
    6    A   treat     fail     1
    7    B control     fail     2
    8    B   treat     fail     2
    

    The untable() function repeats each row of a table according to the value of a numeric "count" vector. In this case, df$value is the count vector, because it contains the number of successes and fails.

    df <- untable(df, df$value)
    

    which will yield one record for each datum, either a "success" or a "fail":

        test   group variable value
    1      A control  success     1
    2      A   treat  success     2
    2.1    A   treat  success     2
    3      B control  success     3
    3.1    B control  success     3
    3.2    B control  success     3
    4      B   treat  success     1
    5      A control     fail     1
    6      A   treat     fail     1
    7      B control     fail     2
    7.1    B control     fail     2
    8      B   treat     fail     2
    8.1    B   treat     fail     2
    

    This is the solution. If required, the data can now be recoded to replace "success" with 1 and "fail" with 0 (and get rid of the the extraneous value and variable columns...)

      df <- data.frame(
        test = df$test, 
        group = df$group, 
        success = as.numeric(df$variable == "success")
      )
    

    This returns the requested solution, tho the rows are sorted differently:

       test   group success
    1     A control       1
    2     A   treat       1
    3     A   treat       1
    4     B control       1
    5     B control       1
    6     B control       1
    7     B   treat       1
    8     A control       0
    9     A   treat       0
    10    B control       0
    11    B control       0
    12    B   treat       0
    13    B   treat       0
    

    Obviously, the data.frame can be resorted, if necessary. How to sort a data.frame in R.