Converting Summary Table of Binary Outcome to Long Tidy DataFrame

I want to convert a table that has several categorical variables, as well as summary of the result of a binary experiment to long format to easily run a logistic regression model.

Is there an easy way to do this that does not involve just making a bunch of vectors with rep() and then combining those into a dataframe? Ideally, I'd like one function that does this automatically, but maybe I'll just need to make my own.

For example, if I start with this summary table:

test   group    success  n 
A      control  1        2
A      treat    2        3
B      control  3        5
B      treat    1        3

I want to be able to switch it back to the following format:

test   group     success
A      control   1
A      control   0
A      treat     1
A      treat     1
A      treat     0
B      control   1
B      control   1
B      control   1
B      control   0
B      control   0
B      treat     1
B      treat     0
B      treat     0

Thanks!

Solution

The reshape package is your friend, here. In this case, melt() and untable() are useful for normalizing the data.

If the example summary data.frame is in a variable called df, an abbreviated answer is:

# replace total n with number of failures
df$fail = df$n - df$success
df$n = NULL

# melt and untable the data.frame
df = melt(df)
df = untable(df, df$value)

# recode the results, e.g., here by creating a new data.frame
df = data.frame(
  test = df$test, 
  group = df$group, 
  success = as.numeric(df$variable == "success")
)

This is a great example of a very general problem. The idea is to back calculate the list of data that underlies a cross-tabulation. Given the cross-tabulation, the back-calculated list of data has one row for each datum and contains the attributes of each datum. Here is a post to the inverse of this question.

In "data geek" parlance, this is a question of putting tabulated data in First Normal Form -- if that is helpful to anyone. You can google data normalization, which will help you design agile data.frames that can be cross-tabulated and analyzed in many different ways.

In detail, for melt() and untable() to work here, the raw data need to be tweaked a bit to include fail (number of failures) rather than total n, but that is simple enough:

df$fail <- df$n - df$success
df$n <- NULL

which gives:

  test   group success fail
1    A control       1    1
2    A   treat       2    1
3    B control       3    2
4    B   treat       1    2

Now we can "melt" the table. melt() can back-calculate the original list of data that was used to create a cross tabulation.

df <- melt(df)

In this case, we get new column called variable that contains either "success" or "fail", and a column called value that contains the datum from the original success or fail column.

  test   group variable value
1    A control  success     1
2    A   treat  success     2
3    B control  success     3
4    B   treat  success     1
5    A control     fail     1
6    A   treat     fail     1
7    B control     fail     2
8    B   treat     fail     2

The untable() function repeats each row of a table according to the value of a numeric "count" vector. In this case, df$value is the count vector, because it contains the number of successes and fails.

df <- untable(df, df$value)

which will yield one record for each datum, either a "success" or a "fail":

    test   group variable value
1      A control  success     1
2      A   treat  success     2
2.1    A   treat  success     2
3      B control  success     3
3.1    B control  success     3
3.2    B control  success     3
4      B   treat  success     1
5      A control     fail     1
6      A   treat     fail     1
7      B control     fail     2
7.1    B control     fail     2
8      B   treat     fail     2
8.1    B   treat     fail     2

This is the solution. If required, the data can now be recoded to replace "success" with 1 and "fail" with 0 (and get rid of the the extraneous value and variable columns...)

  df <- data.frame(
    test = df$test, 
    group = df$group, 
    success = as.numeric(df$variable == "success")
  )

This returns the requested solution, tho the rows are sorted differently:

   test   group success
1     A control       1
2     A   treat       1
3     A   treat       1
4     B control       1
5     B control       1
6     B control       1
7     B   treat       1
8     A control       0
9     A   treat       0
10    B control       0
11    B control       0
12    B   treat       0
13    B   treat       0

Obviously, the data.frame can be resorted, if necessary. How to sort a data.frame in R.