Search code examples
pythonpandasscikit-learnstatisticskernel-density

Stack columns with accompanying label column


I am try to stack a group of columns in order to fit a Kernel Density estimator to in order to understand how the probability of observing an interval changes with respect to a change in time and price.

My current DataFrame is as follows (not exact values, just an example):

date        price   1d_change  2d_price_change
2017-01-03  10.2    1.0        7.8
2017-01-04  11.2    7.8        9.4
2017-01-05  17.0    3.6        1.5
2017-01-06  20.6    -2.1       ...
2017-01-07  18.5    ...        ... 

I would like to stack each price change column into one single column, and create another column that corresponds to the change in time, for example:

price_change    time_interval
10.2            1
11.2            1
17.0            1
20.6            1
18.5            1
7.8             2
9.4             2
1.5             2

I am aware that I can simply use pd.hstack() to achieve this, but I am unsure of how to create a corresponding column that labels the change in time.

Any help gratefully received.


Solution

  • Setting up source data.

    df = pd.DataFrame({
    'date': ['2017-01-03', '2017-01-04', '2017-01-05', '2017-01-06'], 
    'price': [10.2, 11.2, 17, 20.6], 
    '1d': [1, 7.8, 3.6, -2.1], 
    '2d': [7.8, 9.4, 1.5, 3.3]})
    
    df = df[['date', 'price', '1d', '2d']]
    
    print(df)
    
             date  price   1d   2d
    0  2017-01-03   10.2  1.0  7.8
    1  2017-01-04   11.2  7.8  9.4
    2  2017-01-05   17.0  3.6  1.5
    3  2017-01-06   20.6 -2.1  3.3
    

    Now for the solution. The basic idea is as you mentioned to use stack. But some prep work is needed in terms of naming the axes properly, so that when we stack and reset_index, the column names are what we want. The final step is to simply replace columns name labels '1d', '2d' etc with the appropriate integer.

    x = df.set_index('date').stack()
    x.index.set_names(['date', 'time_interval'], inplace=True)
    x.name = 'price_change'
    print(x)
    
    date        time_interval
    2017-01-03  price            10.2
                1d                1.0
                2d                7.8
    2017-01-04  price            11.2
                1d                7.8
                2d                9.4
    2017-01-05  price            17.0
                1d                3.6
                2d                1.5
    2017-01-06  price            20.6
                1d               -2.1
                2d                3.3
    
    stacked = x.reset_index().replace({'price': 1, '1d': 2, '2d': 3})
    print(stacked)
    
              date  time_interval  price_change
    0   2017-01-03              1          10.2
    1   2017-01-03              2           1.0
    2   2017-01-03              3           7.8
    3   2017-01-04              1          11.2
    4   2017-01-04              2           7.8
    5   2017-01-04              3           9.4
    6   2017-01-05              1          17.0
    7   2017-01-05              2           3.6
    8   2017-01-05              3           1.5
    9   2017-01-06              1          20.6
    10  2017-01-06              2          -2.1
    11  2017-01-06              3           3.3