Search code examples
pythonpandasmelt

cast data frame to long, dropping rows, expanding others


I have a data frame to pivot to longer:

import pandas as pd
import io

_1 = pd.read_csv(io.StringIO(
    """date;        origin; val_one; val_two; aaa; bbb; ccc; ddd; eee; fff
    10/11/2009;        aaa;       1;       0;   0;   0;   0;   0;   1;   0
    10/11/2009;        bbb;       0;       1;   1;   0;   0;   0;   0;   1
    10/11/2009;        ccc;       0;       1;   0;   0;   0;   0;   0;   0
    10/11/2009;        ddd;       0;       2;   0;   1;   1;   1;   0;   0"""),
    sep=";").set_axis(['date', 'origin', 'val_one', 'val_two',
            'aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff'], axis=1)

I now want to create origin-target (rows aaa:fff) dyads. Rows where no target exists (i.e. none of the target columns is 1, e.g. the third row, index 2) should be dropped; rows where more than one target exists (e.g. the second row, index 1, where aaa is 1 and fff is one of the target columns is 1) should be turned into two rows. The expected output is:

_2_targ = pd.read_csv(io.StringIO(
"""date;        origin; val_one; val_two; target
10/11/2009;        aaa;       1;       0;    eee
10/11/2009;        bbb;       0;       1;    aaa
10/11/2009;        bbb;       0;       1;    fff
10/11/2009;        ddd;       0;       2;    bbb
10/11/2009;        ddd;       0;       2;    ccc
10/11/2009;        ddd;       0;       2;    ddd"""), 
sep=';').set_axis(['date', 'origin', 'val_one', 'val_two', 'target'])

I have tried using pd.melt to no avail (see below) - this creates all dyads, even the ones I don't want to keep (because no target column is 1).

_2 = pd.melt(_1,
             id_vars=['date', 'origin', 'val_one', 'val_two'],
             value_vars=['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff'],
             var_name='target', value_name='tmp')

What am I missing?


Solution

  • I believe you can melt exactly as you are doing, and then drop rows from your long_df that have a value of 0 for tmp

    import pandas as pd
    import io
    
    #nice reproducible input!
    df1 = pd.read_csv(io.StringIO(
        """date;        origin; val_one; val_two; aaa; bbb; ccc; ddd; eee; fff
        10/11/2009;        aaa;       1;       0;   0;   0;   0;   0;   1;   0
        10/11/2009;        bbb;       0;       1;   1;   0;   0;   0;   0;   1
        10/11/2009;        ccc;       0;       1;   0;   0;   0;   0;   0;   0
        10/11/2009;        ddd;       0;       2;   0;   1;   1;   1;   0;   0"""),
        sep=";").set_axis(['date', 'origin', 'val_one', 'val_two',
                'aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff'], axis=1)
    
    #Melt exactly the same as you did
    long_df = pd.melt(
        df1,
        id_vars=['date', 'origin', 'val_one', 'val_two'],
        value_vars=['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff'],
        var_name='target', value_name='tmp'
    )
    
    #Filter out rows where tmp is 0, sort to match your output, and drop the tmp column
    long_df = long_df[long_df['tmp'].gt(0)].sort_values('origin').drop(columns='tmp').reset_index(drop=True)
    long_df
    

    The output is nearly as you show above, but I think your forgetting the bbb:fff row

                 date       origin  val_one  val_two target
    0      10/11/2009          aaa        1        0    eee
    1      10/11/2009          bbb        0        1    aaa
    2      10/11/2009          bbb        0        1    fff
    3      10/11/2009          ddd        0        2    bbb
    4      10/11/2009          ddd        0        2    ccc
    5      10/11/2009          ddd        0        2    ddd
    

    I've also renamed your variables but that's obviously not important