I am trying to fill the missing lines in my time series data using pandas reindex function.
My data looks like:
100,2007,239,4,29.588,-30.851,-999.0,-999.0,-999.0,-999.00,13.125,-999.00
100,2007,239,5,29.573,-30.843,-999.0,-999.0,-999.0,-999.00,13.126,-999.00
100,2007,239,14,29.389,-30.880,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,15,29.367,-30.901,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,24,29.374,-30.920,-999.0,-999.0,-999.0,-999.00,13.135,-999.00
It is timeseries data for one day with one minute time interval which the fourth column indicates. Unlikely to normal time series index, time index of this data look like 0 to 59, 100 to 159 ....2300 to 2359 because 1 day is 24 hours and 1 hour is 60 minutes. So, fill the gap with 'nan' value, I made the code as bellow:
S = []
for i in range(0,24):
s = np.arange(i*100,i*100+60)
s = list(s)
S = S + s
pd.set_option('max_rows',10)
for INPUT in FileList:
output = INPUT + "result" # set the output files
data=pd.read_csv(INPUT,sep=',',index_col=[3],parse_dates=[3])
index = 'S'#make the reference index to fill
df = data
sk_f = df.reindex(index)
sk_f.to_csv(output,na_rep='nan')
By this code, I purposed to fill the gap by the line of 'nan' following the indice in the fourth column based on S which is the reference index. But the result is just the rows of 'nan' rather than filling the gap as below:
,100,2007,241,22.471,-31.002,-999.0,-999.0.1,-999.0.2,-999.00,13.294,-999.00 .1
0,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
1,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
2,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
3,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
4,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
5,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
6,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
7,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
8,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
9,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
10,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
11,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan
My expectation is to fill the gap of missing lines in the original data. For example, in the original data, there is no lows between 0 to 3 index line. So I would like to fill those lines with original data format. I may miss something.
Firstly, I find problematic indent with create list S = S + s
. You have to use, because list S
kept only last s
:
S = []
for i in range(0,24):
s = np.arange(i*100,i*100+60)
s = list(s)
S = S + s #keep only last s
to:
S = []
for i in range(0,24):
s = np.arange(i*100,i*100+60)
s = list(s)
S = S + s
or shorter:
S = []
for i in range(0,24):
S = S + list(np.arange(i*100,i*100+60))
Next is problematic index = 'S'
I think, it is typo and it can be index = S
.
You can add function bfill()
and fill gaps backward. link
sk_f = df.reindex(index).bfill()
Code:
import pandas as pd
import numpy as np
import io
S = []
for i in range(0,24):
S = S + list(np.arange(i*100,i*100+60))
#original data
temp=u"""100,2007,239,4,29.588,-30.851,-999.0,-999.0,-999.0,-999.00,13.125,-999.00
100,2007,239,5,29.573,-30.843,-999.0,-999.0,-999.0,-999.00,13.126,-999.00
100,2007,239,14,29.389,-30.880,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,15,29.367,-30.901,-999.0,-999.0,-999.0,-999.00,13.131,-999.00
100,2007,239,24,29.374,-30.920,-999.0,-999.0,-999.0,-999.00,13.135,-999.00"""
#pd.set_option('max_rows',10)
data=pd.read_csv(io.StringIO(temp),sep=',', header=None, index_col=[3], parse_dates=[3])
data.index.name = None
print data
# 0 1 2 4 5 6 7 8 9 10 11
#4 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#5 100 2007 239 29.573 -30.843 -999 -999 -999 -999 13.126 -999
#14 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#15 100 2007 239 29.367 -30.901 -999 -999 -999 -999 13.131 -999
#24 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999
index = S #make the reference index to fill
df = data
sk_f = df.reindex(index).bfill()
print sk_f.head(20)
# 0 1 2 4 5 6 7 8 9 10 11
#0 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#1 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#2 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#3 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#4 100 2007 239 29.588 -30.851 -999 -999 -999 -999 13.125 -999
#5 100 2007 239 29.573 -30.843 -999 -999 -999 -999 13.126 -999
#6 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#7 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#8 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#9 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#10 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#11 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#12 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#13 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#14 100 2007 239 29.389 -30.880 -999 -999 -999 -999 13.131 -999
#15 100 2007 239 29.367 -30.901 -999 -999 -999 -999 13.131 -999
#16 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999
#17 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999
#18 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999
#19 100 2007 239 29.374 -30.920 -999 -999 -999 -999 13.135 -999