Search code examples
pythondatetimepandasmissing-datareindex

filling the missing points in the time series data with pandas.date_range and pandas.reindex python


I am trying to fill the missing points in the time series data in the ascii file with pandas. I think that other things are okay, but the first line was filled with nan even though there was data originally. My data sample is:

"2011-08-26 00:00:00",1155179,3.232,23.7,3.281,0.386,25.27,111.5665,28.92,29.83,19.13,0,111.5,13.02,29.77,345.7
"2011-08-26 00:00:30",1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
                                    .
                                    .


"2011-08-26 23:59:30",1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1

And I used the code as below:

t1 = np.genfromtxt(INPUT,dtype=None,delimiter=',',usecols=[0])
start = t1[0].strip('\'"')
end = t1[-1].strip('\'"')
data=pd.read_csv(INPUT,sep=',',index_col=[0],parse_dates=[0])
index = pd.date_range(start,end,freq="30S")
df = data
sk_f = df.reindex(index)

So with this code, I wanted to read the first and the last string of the first column and made them to the index to fill the possible missing points indicated as nan. However, the problem is that first column was also filled in the results as below:

2011-08-26 00:00:00,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan

2011-08-26 00:00:30,1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
                                    .
                                    .


2011-08-26 23:59:30,1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1

It means that the first line was unexpectedly filled eventhough there are data in the original file. From the second line, every thing is okay and filling missing data also seems okay. I am trying to find why it happens. To be honest, I couldn't find the reason yet. Any idea or help would be really appreciated. Thank you, Isaac


Solution

  • I think you can omit read file by genfromtxt and try only read_csv, then found min and max dates for reindex method.

    Or use resample:

    import pandas as pd
    import numpy as np
    import io
    
    temp=u""""2011-08-26 00:00:00",1155179,3.232,23.7,3.281,0.386,25.27,111.5665,28.92,29.83,19.13,0,111.5,13.02,29.77,345.7
    "2011-08-26 00:00:30",1155180,3.289,20.44,2.153,0.222,25.25,111.5735,28.94,29.82,19.53,0,111.5,13.02,29.79,342.4
    "2011-08-26 23:59:30",1155297,12.62,28.06,3.162,1.356,24.3,111.4614,28.65,29.84,19.53,0,111.4,13.06,29.50,350.1"""
    
    #after testing replace io.StringIO(temp) to filename
    df = pd.read_csv(io.StringIO(temp), sep=",", index_col=[0], parse_dates=[0], header=None)
    print df
                              1       2      3      4      5      6         7   \
    0                                                                            
    2011-08-26 00:00:00  1155179   3.232  23.70  3.281  0.386  25.27  111.5665   
    2011-08-26 00:00:30  1155180   3.289  20.44  2.153  0.222  25.25  111.5735   
    2011-08-26 23:59:30  1155297  12.620  28.06  3.162  1.356  24.30  111.4614   
    
                            8      9      10  11     12     13     14     15  
    0                                                                         
    2011-08-26 00:00:00  28.92  29.83  19.13   0  111.5  13.02  29.77  345.7  
    2011-08-26 00:00:30  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
    2011-08-26 23:59:30  28.65  29.84  19.53   0  111.4  13.06  29.50  350.1  
    
    start = df.index.min()
    end = df.index.max()
    print start
    2011-08-26 00:00:00
    print end
    2011-08-26 23:59:30
    
    index = pd.date_range(start,end,freq="30S")
    sk_f = df.reindex(index)
    print sk_f.head()
                              1      2      3      4      5      6         7   \
    2011-08-26 00:00:00  1155179  3.232  23.70  3.281  0.386  25.27  111.5665   
    2011-08-26 00:00:30  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
    2011-08-26 00:01:00      NaN    NaN    NaN    NaN    NaN    NaN       NaN   
    2011-08-26 00:01:30      NaN    NaN    NaN    NaN    NaN    NaN       NaN   
    2011-08-26 00:02:00      NaN    NaN    NaN    NaN    NaN    NaN       NaN   
    
                            8      9      10  11     12     13     14     15  
    2011-08-26 00:00:00  28.92  29.83  19.13   0  111.5  13.02  29.77  345.7  
    2011-08-26 00:00:30  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
    2011-08-26 00:01:00    NaN    NaN    NaN NaN    NaN    NaN    NaN    NaN  
    2011-08-26 00:01:30    NaN    NaN    NaN NaN    NaN    NaN    NaN    NaN  
    2011-08-26 00:02:00    NaN    NaN    NaN NaN    NaN    NaN    NaN    NaN  
    
    print df.resample('30S', fill_method='ffill').head()
                              1      2      3      4      5      6         7   \
    0                                                                           
    2011-08-26 00:00:00  1155179  3.232  23.70  3.281  0.386  25.27  111.5665   
    2011-08-26 00:00:30  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
    2011-08-26 00:01:00  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
    2011-08-26 00:01:30  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
    2011-08-26 00:02:00  1155180  3.289  20.44  2.153  0.222  25.25  111.5735   
    
                            8      9      10  11     12     13     14     15  
    0                                                                         
    2011-08-26 00:00:00  28.92  29.83  19.13   0  111.5  13.02  29.77  345.7  
    2011-08-26 00:00:30  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
    2011-08-26 00:01:00  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
    2011-08-26 00:01:30  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4  
    2011-08-26 00:02:00  28.94  29.82  19.53   0  111.5  13.02  29.79  342.4