Search code examples
algorithmdata-structureslanguage-agnosticdataset

Data Set Manipulation


I need to reverse engineer a data set to its original form. The original data set was derived from a process where multiple users who have multiple characteristics enter a room and some click on a button. The column variables are in indicator form so where a user click on the button or have a certain characteristic this is recorded as one and where they don't it's indicated by a zero. This data set is then transformed in a form where the characteristic types are observations represented by two characteristic variables. This new data set shows the users who have two characteristics, the amount of them, and their button clicks. this also encompasses all users. my explanation might not be the clearest so here is an image that might help might explanation

enter image description here

I'm thinking of using some type of look up algorithm to solve this but that not might be too efficient.


Solution

  • Unfortunately, in general, you will not be able to unambiguously reverse engineer your data set. Ignoring for the moment the action column, consider the following two data sets:

    Data set 1:

    A B C
    1 1 1
    1 1 0
    0 1 1
    1 0 1
    1 0 0
    1 0 0
    0 1 0
    0 1 0
    0 0 1
    0 0 1
    

    Data set 2:

    A B C
    1 1 0
    1 1 0
    1 0 1
    1 0 1
    0 1 1
    0 1 1
    1 0 0
    0 1 0
    0 0 1
    

    Unless I'm mistaken, these two data sets would show the same number of users under each pair of characteristics:

    A A 5
    A B 2
    A C 2
    B B 5
    B C 2
    C C 5
    

    Now, you might be tempted to think: "Hey, the first data set has 10 users but the second data set has only 9. If I'm able to get the total number of users, will this solve my problem?" The answer is mostly no. If you have three or fewer characteristics, then the answer is yes (see: Inclusion-exclusion Principle). However, if you have more than three characteristics, the answer is no. You can construct similarly ambiguous examples where the total number of users is the same.