python datetime pandas missing-data reindex

filling the missing points in the time series data with pandas.date_range and pandas.reindex python

I am trying to fill the missing points in the time series data in the ascii file with pandas. I think that other things are okay, but the first line was filled with nan even though there was data originally. My data sample is:

"2011-08-26 00:00:00",1155179,3.232,23.7,3.281,0.386,25.27,111.5665,28.92,29.83,19.13,0,111.5,13.02,29.77,345.7
"2011-08-26 00:00:30",1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
                                    .
                                    .


"2011-08-26 23:59:30",1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1

And I used the code as below:

t1 = np.genfromtxt(INPUT,dtype=None,delimiter=',',usecols=[0])
start = t1[0].strip('\'"')
end = t1[-1].strip('\'"')
data=pd.read_csv(INPUT,sep=',',index_col=[0],parse_dates=[0])
index = pd.date_range(start,end,freq="30S")
df = data
sk_f = df.reindex(index)

So with this code, I wanted to read the first and the last string of the first column and made them to the index to fill the possible missing points indicated as nan. However, the problem is that first column was also filled in the results as below:

2011-08-26 00:00:00,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan

2011-08-26 00:00:30,1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
                                    .
                                    .


2011-08-26 23:59:30,1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1

It means that the first line was unexpectedly filled eventhough there are data in the original file. From the second line, every thing is okay and filling missing data also seems okay. I am trying to find why it happens. To be honest, I couldn't find the reason yet. Any idea or help would be really appreciated. Thank you, Isaac

Solution

I think you can omit read file by genfromtxt and try only read_csv, then found min and max dates for reindex method.

Or use resample:

import pandas as pd
import numpy as np
import io

temp=u""""2011-08-26 00:00:00",1155179,3.232,23.7,3.281,0.386,25.27,111.5665,28.92,29.83,19.13,0,111.5,13.02,29.77,345.7
"2011-08-26 00:00:30",1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
"2011-08-26 23:59:30",1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1"""

#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=",", index_col=[0], parse_dates=[0], header=None)
print df
                          1       2      3      4      5      6         7   \
0                                                                            
2011-08-26 00:00:00  1155179   3.232  23.70  3.281  0.386  25.27  111.5665   
2011-08-26 00:00:30  1155180   3.289  20.44  2.153  0.222  25.25  111.5735   
2011-08-26 23:59:30  1155297  12.620  28.06  3.162  1.356  24.30  111.4614   

                        8      9      10  11     12     13     14     15  
0                                                                         
2011-08-26 00:00:00  28.92  29.83  19.13   0  111.5  13.02  29.77  345.7  
2011-08-26 00:00:30  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
2011-08-26 23:59:30  28.65  29.84  19.53   0  111.4  13.06  29.50  350.1

start = df.index.min()
end = df.index.max()
print start
2011-08-26 00:00:00
print end
2011-08-26 23:59:30

index = pd.date_range(start,end,freq="30S")
sk_f = df.reindex(index)
print sk_f.head()
                          1      2      3      4      5      6         7   \
2011-08-26 00:00:00  1155179  3.232  23.70  3.281  0.386  25.27  111.5665   
2011-08-26 00:00:30  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
2011-08-26 00:01:00      NaN    NaN    NaN    NaN    NaN    NaN       NaN   
2011-08-26 00:01:30      NaN    NaN    NaN    NaN    NaN    NaN       NaN   
2011-08-26 00:02:00      NaN    NaN    NaN    NaN    NaN    NaN       NaN   

                        8      9      10  11     12     13     14     15  
2011-08-26 00:00:00  28.92  29.83  19.13   0  111.5  13.02  29.77  345.7  
2011-08-26 00:00:30  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
2011-08-26 00:01:00    NaN    NaN    NaN NaN    NaN    NaN    NaN    NaN  
2011-08-26 00:01:30    NaN    NaN    NaN NaN    NaN    NaN    NaN    NaN  
2011-08-26 00:02:00    NaN    NaN    NaN NaN    NaN    NaN    NaN    NaN

print df.resample('30S', fill_method='ffill').head()
                          1      2      3      4      5      6         7   \
0                                                                           
2011-08-26 00:00:00  1155179  3.232  23.70  3.281  0.386  25.27  111.5665   
2011-08-26 00:00:30  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
2011-08-26 00:01:00  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
2011-08-26 00:01:30  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
2011-08-26 00:02:00  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   

                        8      9      10  11     12     13     14     15  
0                                                                         
2011-08-26 00:00:00  28.92  29.83  19.13   0  111.5  13.02  29.77  345.7  
2011-08-26 00:00:30  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
2011-08-26 00:01:00  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
2011-08-26 00:01:30  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
2011-08-26 00:02:00  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4