Search code examples
umlstructureclass-diagram

Formal UML representation of reshaping a data frame


For documentation of the restructuring of a data table from "wide" using a criteria column for each score to using a score column and a criterion column my first reaction was to use UML class diagram.

Wide and long versions of the same class / data table

I am aware that by changing the structure of the data table, the class attributes have not changed.

My first question is whether the wide or the long version is the more correct representation of the data table?

My second question is whether it would make sense to relate the two representations - and if so, by which relationship?

My third question would be whether something else than a UML class diagram would be more suitable for documenting the reshaping (data preprocessing before showing distribution as a box pot in R).


Solution

  • You jumped a little bit to fast from the table to the UML. This makes your question very confusing, because what is wide as a table is represented long as a class, and the contrary.

    Reformulating your problem, it appears that you are refactoring some tables. The wide table shows several values for a same student in the same row. This means that the maximum number of exercises is fixed by the table structure:

    ID    Ex1  Ex2  Ex3 .... Ex N 
    -----------------------------
    111    A    A   A   ...   A
    119    A    C   -   ...   D
    127    B    F   B   ...   F
    

    The long table has fewer columns, and each row shows only 1 specific score of 1 specific student:

    ID   #    Score 
    ---------------
    111  1     A 
    111  2     A
    111  3     A   
              ...
    111  N     A
    119  1     A
    119  2     C
              ...
    

    You can model this structure in an UML class diagram. But in UML, the table layout doesn't matter: that's an issue of the ORM mapping and you could perfectly have one class model (with an attribute or an association having a multiplicity 1..N) that could be implemented using either the wide or the long version. If the multiplicity would be 1..* only the long option would work.

    Now to your questions:

    1. Both representations are correct; they just have different characteristics. The wide is inflexible, since the maximum number of scores is fixed by the table structure. Also adding a new score requires in fact to update a record (so the possible concurrency of both models is not the same). The long is a little more complex to use if you want to show history of a student scores in a row.
    2. Yes it makes sense to relate both, especially if you're writing for a transformation of the first into the second.
    3. UML would not add necessarily value here. If you're really about tables and values, you could as well use an Entity/Relationship diagram. But UML has the advantage of allowing database modelling as well and it lets you add behavioral aspects. If not now, then later. You could consider using the non-standard «table» stereotype, to clarify what you are modelling a table (so a low level view on your design).