Search code examples
pythonnumpypandasquantitative-finance

manipulating value of pandas dataframe cell based on value in previous row without iteration


I have a pandas dataframe with~3900 rows and 6 columns compiled from Google Finance . One of these columns defines a time in unix format, specifically defining a time during the trading day for a market. In this case the DJIA from 930A EST to 4P EST. However, only the cell for the beginning of each day (930A) has the complete unix time stamp (prefixed with an 'a') and the others are the minutes after the first time of the day.

Here is an example of the raw data:

          Date   Close    High     Low    Open  Volume
0  a1450449000  173.87  173.87  173.83  173.87   46987
1            1  173.61  173.83  173.55  173.78   19275
2            2  173.37  173.63  173.37  173.60   16014
3            3  173.50  173.59  173.31  173.34   14198
4            4  173.50  173.57  173.46  173.52    7010
          Date   Close    High     Low    Open  Volume
388          388  171.16  171.27  171.15  171.26   11809
389          389  171.11  171.23  171.07  171.18   30449
390          390  170.89  171.16  170.89  171.09  163937
391  a1450708200  172.28  172.28  172.28  172.28   23880
392            1  172.27  172.27  172.00  172.06    2719

The change at index 391 is not contiguous such that a solution like @Stefan's would unfortunately not correctly adjust the Date value.

I can easily enough go through with a lambda and line by line remove the 'a' (if necessary) convert the values to an integer and convert the minutes past 930A into seconds with the following code:

import pandas as pd
import numpy as np
import datetime

bars = pd.read_csv(r'http://www.google.com/finance/getprices?i=60&p=10d&f=d,o,h,l,c,v&df=cpct&q=DIA', skiprows=7, header=None, names=['Date', 'Close', 'High', 'Low', 'Open', 'Volume'])

bars['Date'] = bars['Date'].map(lambda x: int(x[1:]) if x[0] == 'a' else int(x))    
bars['Date'] = bars['Date'].map(lambda u: u * 60 if u < 400 else u)

Now what I would like to do is, without iterating over the dataframe, determine if the value of bars['Date'] is not a unix time stamp (e.g. < 24000 in the terms of this data set). If so I want to add that value to the time stamp for that particular day to create a complete unix time stamp for each entry.

I know that I can compare the previous row via:

bars['Date'][:-1]>bars['Date'][1:]

I feel like that would be the way to go but I cant figure out a way to use this in a function as it returns a series.

Thanks in advance for any help!


Solution

  • You could add a new column that always contains the latest Timestamp and then add to the Date where necessary.

    threshold = 24000
    bars['Timestamp'] = bars[bars['Date']>threshold].loc[:, 'Date']
    bars['Timestamp'] = bars['Timestamp'].fillna(method='ffill')
    bars['Date'] = bars.apply(lambda x: x.Date + x.Timestamp if x.Date < threshold else x.Date, axis=1)
    bars.drop('Timestamp', axis=1, inplace=True)
    

    to get:

                Date   Close     High     Low    Open  Volume
    0     1450449000  173.87  173.870  173.83  173.87   46987
    1     1450449060  173.61  173.830  173.55  173.78   19275
    2     1450449120  173.37  173.630  173.37  173.60   16014
    3     1450449180  173.50  173.590  173.31  173.34   14198
    4     1450449240  173.50  173.570  173.46  173.52    7010
    5     1450449300  173.66  173.680  173.44  173.45   10597
    6     1450449360  173.40  173.670  173.34  173.67   14270
    7     1450449420  173.36  173.360  173.13  173.32   22485
    8     1450449480  173.29  173.480  173.25  173.36   18542