I have a pandas dataframe with~3900 rows and 6 columns compiled from Google Finance . One of these columns defines a time in unix format, specifically defining a time during the trading day for a market. In this case the DJIA from 930A EST to 4P EST. However, only the cell for the beginning of each day (930A) has the complete unix time stamp (prefixed with an 'a') and the others are the minutes after the first time of the day.
Here is an example of the raw data:
Date Close High Low Open Volume
0 a1450449000 173.87 173.87 173.83 173.87 46987
1 1 173.61 173.83 173.55 173.78 19275
2 2 173.37 173.63 173.37 173.60 16014
3 3 173.50 173.59 173.31 173.34 14198
4 4 173.50 173.57 173.46 173.52 7010
Date Close High Low Open Volume
388 388 171.16 171.27 171.15 171.26 11809
389 389 171.11 171.23 171.07 171.18 30449
390 390 170.89 171.16 170.89 171.09 163937
391 a1450708200 172.28 172.28 172.28 172.28 23880
392 1 172.27 172.27 172.00 172.06 2719
The change at index 391 is not contiguous such that a solution like @Stefan's would unfortunately not correctly adjust the Date
value.
I can easily enough go through with a lambda and line by line remove the 'a' (if necessary) convert the values to an integer and convert the minutes past 930A into seconds with the following code:
import pandas as pd
import numpy as np
import datetime
bars = pd.read_csv(r'http://www.google.com/finance/getprices?i=60&p=10d&f=d,o,h,l,c,v&df=cpct&q=DIA', skiprows=7, header=None, names=['Date', 'Close', 'High', 'Low', 'Open', 'Volume'])
bars['Date'] = bars['Date'].map(lambda x: int(x[1:]) if x[0] == 'a' else int(x))
bars['Date'] = bars['Date'].map(lambda u: u * 60 if u < 400 else u)
Now what I would like to do is, without iterating over the dataframe, determine if the value of bars['Date'] is not a unix time stamp (e.g. < 24000 in the terms of this data set). If so I want to add that value to the time stamp for that particular day to create a complete unix time stamp for each entry.
I know that I can compare the previous row via:
bars['Date'][:-1]>bars['Date'][1:]
I feel like that would be the way to go but I cant figure out a way to use this in a function as it returns a series.
Thanks in advance for any help!
You could add a new column that always contains the latest Timestamp
and then add to the Date
where necessary.
threshold = 24000
bars['Timestamp'] = bars[bars['Date']>threshold].loc[:, 'Date']
bars['Timestamp'] = bars['Timestamp'].fillna(method='ffill')
bars['Date'] = bars.apply(lambda x: x.Date + x.Timestamp if x.Date < threshold else x.Date, axis=1)
bars.drop('Timestamp', axis=1, inplace=True)
to get:
Date Close High Low Open Volume
0 1450449000 173.87 173.870 173.83 173.87 46987
1 1450449060 173.61 173.830 173.55 173.78 19275
2 1450449120 173.37 173.630 173.37 173.60 16014
3 1450449180 173.50 173.590 173.31 173.34 14198
4 1450449240 173.50 173.570 173.46 173.52 7010
5 1450449300 173.66 173.680 173.44 173.45 10597
6 1450449360 173.40 173.670 173.34 173.67 14270
7 1450449420 173.36 173.360 173.13 173.32 22485
8 1450449480 173.29 173.480 173.25 173.36 18542