I am trying to merge two MultiIndex'ed dataframes. My code is below. The issue, as you can see in the output, is that the "DATE" index is repeated, whereas I'd like all the values (OPEN_INT, PX_LAST) to be on the same date index... any ideas? I've tried both append, and concat but both give me similar results.
if df.empty:
df = bbg_historicaldata(t, f, startDate, endDate)
print(df)
datesArray = list(df.index)
tArray = [t for i in range(len(datesArray))]
arrays = [tArray, datesArray]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['TICKER', 'DATE'])
df = pd.DataFrame({f : df[f].values}, index=index)
else:
temp = bbg_historicaldata(t,f,startDate,endDate)
print(temp)
datesArray = list(temp.index)
tArray = [t for i in range(len(datesArray))]
arrays = [tArray, datesArray]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['TICKER', 'DATE'])
temp = pd.DataFrame({f : temp[f].values}, index=index)
#df = df.append(temp, ignore_index = True)
df = pd.concat([df, temp], axis = 1).sortlevel()
Essentially want no NaN's!
PX_LAST OPEN_INT PX_LAST OPEN_INT PX_LAST \
TICKER DATE
EDH8 COMDTY 2017-02-01 98.365 1008044.0 NaN NaN NaN
2017-02-02 98.370 1009994.0 NaN NaN NaN
2017-02-03 98.360 1019181.0 NaN NaN NaN
2017-02-06 98.405 1023863.0 NaN NaN NaN
2017-02-07 98.410 1024609.0 NaN NaN NaN
2017-02-08 98.435 1046258.0 NaN NaN NaN
2017-02-09 98.395 1050291.0 NaN NaN NaN
EDM8 COMDTY 2017-02-01 NaN NaN 98.245 726739.0 NaN
2017-02-02 NaN NaN 98.250 715081.0 NaN
2017-02-03 NaN NaN 98.235 723936.0 NaN
2017-02-06 NaN NaN 98.285 729324.0 NaN
2017-02-07 NaN NaN 98.295 728673.0 NaN
2017-02-08 NaN NaN 98.325 728520.0 NaN
2017-02-09 NaN NaN 98.280 741840.0 NaN
EDU8 COMDTY 2017-02-01 NaN NaN NaN NaN 98.130
2017-02-02 NaN NaN NaN NaN 98.135
2017-02-03 NaN NaN NaN NaN 98.120
2017-02-06 NaN NaN NaN NaN 98.180
2017-02-07 NaN NaN NaN NaN 98.190
2017-02-08 NaN NaN NaN NaN 98.225
2017-02-09 NaN NaN NaN NaN 98.175
EDIT: Doing Axis = 0, gives the following:. I'd like it to collapse the duplicated dates (ie, each date index to have unique values, no duplicated days or NaNs)
OPEN_INT PX_LAST
TICKER DATE
EDH8 COMDTY 2017-02-01 NaN 98.365
2017-02-01 1008044.0 NaN
2017-02-02 NaN 98.370
2017-02-02 1009994.0 NaN
2017-02-03 NaN 98.360
2017-02-03 1019181.0 NaN
2017-02-06 NaN 98.405
2017-02-06 1023863.0 NaN
2017-02-07 NaN 98.410
2017-02-07 1024609.0 NaN
2017-02-08 NaN 98.435
2017-02-08 1046258.0 NaN
2017-02-09 NaN 98.395
2017-02-09 1050291.0 NaN
EDM8 COMDTY 2017-02-01 NaN 98.245
2017-02-01 726739.0 NaN
2017-02-02 NaN 98.250
2017-02-02 715081.0 NaN
2017-02-03 NaN 98.235
2017-02-03 723936.0 NaN
2017-02-06 NaN 98.285
2017-02-06 729324.0 NaN
2017-02-07 NaN 98.295
2017-02-07 728673.0 NaN
2017-02-08 NaN 98.325
2017-02-08 728520.0 NaN
2017-02-09 NaN 98.280
2017-02-09 741840.0 NaN
Here is the input data printed. I've added print(df) and print(temp) to the above. They're all dataframes with DATE as the index. The TICKER index comes from the variable "f" from the loop "for f in fields:"
PX_LAST
DATE
2017-02-01 98.365
2017-02-02 98.370
2017-02-03 98.360
2017-02-06 98.405
2017-02-07 98.410
2017-02-08 98.435
2017-02-09 98.395
OPEN_INT
DATE
2017-02-01 1008044.0
2017-02-02 1009994.0
2017-02-03 1019181.0
2017-02-06 1023863.0
2017-02-07 1024609.0
2017-02-08 1046258.0
2017-02-09 1050291.0
PX_LAST
DATE
2017-02-01 98.245
2017-02-02 98.250
2017-02-03 98.235
2017-02-06 98.285
2017-02-07 98.295
2017-02-08 98.325
2017-02-09 98.280
OPEN_INT
DATE
2017-02-01 726739.0
2017-02-02 715081.0
2017-02-03 723936.0
2017-02-06 729324.0
2017-02-07 728673.0
2017-02-08 728520.0
2017-02-09 741840.0
PX_LAST
DATE
2017-02-01 98.130
2017-02-02 98.135
2017-02-03 98.120
2017-02-06 98.180
2017-02-07 98.190
2017-02-08 98.225
2017-02-09 98.175
OPEN_INT
DATE
2017-02-01 584448.0
2017-02-02 574246.0
2017-02-03 581897.0
2017-02-06 585169.0
2017-02-07 590248.0
2017-02-08 598478.0
2017-02-09 595884.0
Your logic is a little hard to follow (it's hard to see why sometimes you're getting different columns from your data call, for example). AFAICT, though, really you just want to do a join
among all the frames with the same ticker (if you set the index to TICKER, DATE) or a merge
if TICKER and DATE are columns, and then concatenate the results of those. It's trying to do them both in one step which is causing the problem.
Alternatively, we can just concat the whole thing and then pivot, which is what I'll do here because it's easier to show.
(As an aside, repeatedly concatenating within a loop can be a performance problem because a lot of data needs to be copied each time, and should generally be avoided -- build a collection of what you want to concatenate first, and then apply that.)
Assuming that each of your frames starts looking like the following (where the column might be different):
In [532]: df
Out[532]:
PX_LAST
DATE
2017-02-01 98.365
2017-02-02 98.370
2017-02-03 98.360
2017-02-06 98.405
2017-02-07 98.410
2017-02-08 98.435
2017-02-09 98.395
then instead of what you're doing now I'd just add the ticker to the frame and reset the index:
In [549]: df = df.assign(TICKER=t).reset_index() #TICKER variable = t
Out[549]:
DATE PX_LAST TICKER
0 2017-02-01 98.365 EDH8 COMDTY
1 2017-02-02 98.370 EDH8 COMDTY
2 2017-02-03 98.360 EDH8 COMDTY
3 2017-02-06 98.405 EDH8 COMDTY
4 2017-02-07 98.410 EDH8 COMDTY
5 2017-02-08 98.435 EDH8 COMDTY
6 2017-02-09 98.395 EDH8 COMDTY
To make the concatenation more memory-friendly, let's melt this:
In [579]: pd.melt(df, id_vars=["TICKER", "DATE"])
Out[579]:
TICKER DATE variable value
0 EDH8 COMDTY 2017-02-01 PX_LAST 98.365
1 EDH8 COMDTY 2017-02-02 PX_LAST 98.370
2 EDH8 COMDTY 2017-02-03 PX_LAST 98.360
3 EDH8 COMDTY 2017-02-06 PX_LAST 98.405
4 EDH8 COMDTY 2017-02-07 PX_LAST 98.410
5 EDH8 COMDTY 2017-02-08 PX_LAST 98.435
6 EDH8 COMDTY 2017-02-09 PX_LAST 98.395
and append this to a list dfs
. Now the partial frames will combine nicely, because they all have the same columns, and we can pivot to get our desired output:
In [589]: pd.concat(dfs).pivot_table(index=["TICKER", "DATE"], columns="variable", values="value")
Out[589]:
variable OPEN_INT PX_LAST
TICKER DATE
EDH8 COMDTY 2017-02-01 1008044.0 98.365
2017-02-02 1009994.0 98.370
2017-02-03 1019181.0 98.360
2017-02-06 1023863.0 98.405
[...]
This avoids having all those intermediate NaNs. Since the concatenation+pivot approach will work even if you don't melt, at first I didn't do the melting, but on second thought having those intermediate NaNs is a bad idea even though it works because the intermediate memory requirements could grow to be prohibitive.