Search code examples
rreshapedataformat

Long and wide data – when to use what?


I'm in the process of compiling data from different data sets into one data set for analysis. I'll be doing data exploration, trying different things to find out what regularities may be hidden in the data, so I don't currently have a specific method in mind. Now I'm wondering if I should compile my data into long or wide format.

Which format should I use, and why?

I understand that data can be reshaped from long to wide or vice versa, but the mere existence of this functionality implies that the need to reshape sometimes arises and this need in turn implies that a specific format might be better suited for a certain task. So when do I need which format, and why?

I'm not asking about performance. That has been covered in other questions.


Solution

  • Hadley Wickham's Tidy Data paper, and the tidyr package that is his (latest) implementation of its principles, is a great place to start.

    The rough answer to the question is that data, during processing, should always be long, and should only be widened for display purposes. Be cautious with this, though, as here "long" refers more to "tidy", rather than the pure long form.

    Examples

    Take, for example, the mtcars dataset. This is already in tidy form, in that each row represents a single observation. So "lengthening" it, to get something like this

            model type   value
    1 AMC Javelin  mpg  15.200
    2 AMC Javelin  cyl   8.000
    3 AMC Javelin disp 304.000
    4 AMC Javelin   hp 150.000
    5 AMC Javelin drat   3.150
    6 AMC Javelin   wt   3.435
    

    is counterproductive; mpg and cyl are not comparable in any meaningful way.

    Taking the ChickWeight dataset (which is in long form) and transforming it to wide by time

    require(tidyr)
    ChickWeight %>% spread(Time, weight)
       Chick Diet  0  2  4  6   8  10  12  14  16  18  20  21
    1     18    1 39 35 NA NA  NA  NA  NA  NA  NA  NA  NA  NA
    2     16    1 41 45 49 51  57  51  54  NA  NA  NA  NA  NA
    3     15    1 41 49 56 64  68  68  67  68  NA  NA  NA  NA
    4     13    1 41 48 53 60  65  67  71  70  71  81  91  96
    5      9    1 42 51 59 68  85  96  90  92  93 100 100  98
    6     20    1 41 47 54 58  65  73  77  89  98 107 115 117
    7     10    1 41 44 52 63  74  81  89  96 101 112 120 124
    8      8    1 42 50 61 71  84  93 110 116 126 134 125  NA
    9     17    1 42 51 61 72  83  89  98 103 113 123 133 142
    10    19    1 43 48 55 62  65  71  82  88 106 120 144 157
    11     4    1 42 49 56 67  74  87 102 108 136 154 160 157
    12     6    1 41 49 59 74  97 124 141 148 155 160 160 157
    13    11    1 43 51 63 84 112 139 168 177 182 184 181 175
    ...
    

    gives a visualization that may be useful, but for data analysis purposes, is very inconvenient, as computing things like growth rate become cumbersome.