Search code examples
python-3.xdatatableextractdata-manipulation

Can we separate data using Unique ID in to the following format?


Current Format:

UNIQUE ID NAME AGE DEP RANK
001 John 10 4th 1
002 Priya 11 4th 2
003 Jack 15 5th 2
004 Jill 14 5th 1

Expected Format:

UNIQUE ID NAME COLUMN_NO
001 John 1
001 10 2
001 4th 3
001 1 4
002 Priya 1
002 11 2
002 4th 3
002 2 4

Solution

  • My starting point:

    >>> df
        UNIQUE ID   NAME    AGE  DEP   RANK
    0           1   John     10  4th      1
    1           2  Priya     11  4th      2
    2           3   Jack     15  5th      2
    3           4   Jill     14  5th      1
    

    The basic transformation you need is provided by df.stack, which results in:

    0  UNIQUE ID         1
       NAME           John
       AGE              10
       DEP             4th
       RANK              1
    1  UNIQUE ID         2
       NAME          Priya
    [...]
    

    However, you want column UNIQUE ID to be treated separately. This can be accomplished by making it the index:

    >>> df.set_index('UNIQUE ID').stack()
    UNIQUE ID
    1          NAME     John
               AGE         10
               DEP       4th
               RANK         1
    2          NAME    Priya
               AGE         11
               DEP       4th
               RANK         2
    

    The last missing bit are the column names: you want them renamed to numbers. This could be accomplished two different ways: a) by re-assigning df.columns (after having moved column UNIQUE ID to the index first):

    df = df.set_index('UNIQUE_ID')
    df.columns = range(1, 5)
    

    or b) by df.renaming the columns:

    df = df.set_index('UNIQUE_ID')
    df = df.rename(columns={'NAME': 1, 'AGE': 2, 'DEP': 3, 'RANK': 4})
    

    And finally you can convert the resulting Series back to a DataFrame. The most elegant way to get COLUMN NO at the right place is using df.rename_axis before stacking. All together as one expression (possibly better to split it up):

    >>> (df.set_index('UNIQUE ID')
           .rename(columns={'NAME': 1, 'AGE': 2, 'DEP': 3, 'RANK': 4})
           .rename_axis('COLUMN NO', axis=1)
           .stack()
           .to_frame('NAME')
           .reset_index())
    
        UNIQUE ID  COLUMN NO    NAME
    0           1          1   John
    1           1          2      10
    2           1          3    4th
    3           1          4       1
    4           2          1  Priya
    5           2          2      11
    6           2          3    4th
    7           2          4       2
    8           3          1   Jack
    9           3          2      15
    10          3          3    5th
    11          3          4       2
    12          4          1   Jill
    13          4          2      14
    14          4          3    5th
    15          4          4       1
    

    Things left out: reading the data; preserving the correct type: UNIQUE ID only looks numeric, but has leading zeros that probably want to be preserved; so parsing them as a string would be better.