Search code examples
rdataframereshape2conventionslong-format-data

Long format dataframe column hierarchy


I am working with long format dataframes and intend to write a package myself. While it is not crucial, I would like to adhere to common practice of column hierarchy, e.g. when I have 3 participants and two sessions for each participant, the column denoting the session would be seen as having a higher position in the hierarchy.

It seems to be standard that the value is to the very right. However, for the identifier-columns, it is not clear if the column with the highest position should be to the very left or almost very right, i.e. I could make it be session, participant, value or participant, session, value. Intuitive for me would be the former and ChatGPT also told me so, but reshape2::melt() uses the latter order, when converting a multidimensional array into a dataframe, as it puts the highest dimension to the right instead of to the left.

data.frame(session = rep(1:2, each = 3), participant = 1:3, value = sample(6)/100)
data.frame(participant = 1:3, session = rep(1:2, each = 3), value = sample(6)/100)

Solution

  • Based on my understanding of the question, I haven't seen any clear rules of thumb, or standards in how we organize interim tables while we are doing the data flow.

    Looking across various style guides, there isn't any reference to how to organize temporary columns during the transformation stage.

    The only consistency I see is to generally speaking have numeric columns on the right with factor/character/ data column on the left. Typically date columns go last next to the numeric columns

    When you have character/factor columns, it is really driven by domain / context as to which hierarchy is more "logical" to arrange.

    If you are grouping your data, and then adding a column or summarizing, putting the grouping columns on the outside can help the user visualize grouping actions.

    In terms of pivot_longer() or pivot_wider() type actions it would will depend on which columns you are targeting but again the id columns in the far left may help the user visualize the pivoting actions with the names_fromcolumns being as close to values_from columns also helping in the pivot_wider scenario.

    Hope this helps