Hi I have created a dictionary of dataFrame with this code
import os
import pandas
import glob
path="G:\my_dir\*"
dataList={}
for files in glob.glob(path):
dataList[files]=(read_csv(files,sep=";",index_col='Date'))
The different dataframe present in the dictory have different time sample. An example of dataFrame(A) is
Date Volume Value
2014-01-04 06:00:02 6062 108000.0
2014-01-04 06:06:05 6062 107200.0
2014-01-04 06:12:07 6062 97400.0
2014-01-04 06:18:10 6062 99200.0
2014-01-04 06:24:12 6062 91300.0
2014-01-04 06:30:14 6062 84100.0
2014-01-04 06:36:17 6062 57000.0
Example of dataFrame(B) is
Date Volume Value
2014-01-04 05:52:50 6062 4.7
2014-01-04 05:58:53 6062 4.7
2014-01-04 06:04:56 6062 4.9
2014-01-04 06:10:58 6062 5.1
2014-01-04 06:17:01 6062 5.2
2014-01-04 06:23:03 6062 5.2
2014-01-04 06:29:05 6062 5.5
2014-01-04 06:35:08 6062 5.5
The different data frame don't have the same number of rows. I want to merge the different data frame in a single one like this:
Data Volume B A Value(DataframeN)
2014/04/01 05:52:50 6062 4.70 NaN
2014/04/01 05:58:53 6062 4.70 NaN
2014/04/01 06:04:56 6062 4.90 107465.51
2014/04/01 06:10:58 6062 5.10 100652.60
2014/04/01 06:17:01 6062 5.20 98899.57
2014/04/01 06:23:03 6062 5.20 92618.56
2014/04/01 06:29:05 6062 5.50 85301.73
2014/04/01 06:35:08 6062 5.50 61523.06
I have done this easily with Matlab using with the command
ts_A=timeseries(ValueA,datenum(DateA));
ts_B=timeseries(ValueB,datenum(DateB));
res_A=resample(ts_A,datenum(DateB));
I have to do this for several sets of csv files so I wanted to automate the process with python.
Tnx
You can concat
the two DataFrames
, interpolate
, then reindex
on the DataFrame
you want.
I assume we have a certain number of DataFrames
, where the Date
is a DateTimeIndex
in all of them. I will use two in this example, since you used two in the question, but the code will work for any number.
df_a
:
Volume Value
Date
2014-01-04 06:00:02 6062 108000
2014-01-04 06:06:05 6062 107200
2014-01-04 06:12:07 6062 97400
2014-01-04 06:18:10 6062 99200
2014-01-04 06:24:12 6062 91300
2014-01-04 06:30:14 6062 84100
2014-01-04 06:36:17 6062 57000
df_b
:
Volume Value
Date
2014-01-04 05:52:50 6062 4.7
2014-01-04 05:58:53 6062 4.7
2014-01-04 06:04:56 6062 4.9
2014-01-04 06:10:58 6062 5.1
2014-01-04 06:17:01 6062 5.2
2014-01-04 06:23:03 6062 5.2
2014-01-04 06:29:05 6062 5.5
2014-01-04 06:35:08 6062 5.5
And I will put these into a dict
for the example. You read them directly into a dict
, so you don't need to do this step. I just want to show how my example dict
is formatted. The dict
keys
don't matter, any valid dict
key
will work:
dataList = {'a': df_a,
'b': df_b}
This gets us to where you currently are, with my dataList
hopefully having the same format as yours.
The first thing you need to do is to combine the DataFrames
. I use the dict
keys
as MultiIndex
column names so you can keep track of which instance of a given column came from which DataFrame
. You can do that like so:
df = pd.concat(dataList.values(), axis=1, keys=dataList.keys())
This gives you a DataFrame
like this:
a b
Volume Value Volume Value
Date
2014-01-04 05:52:50 NaN NaN 6062 4.7
2014-01-04 05:58:53 NaN NaN 6062 4.7
2014-01-04 06:00:02 6062 108000 NaN NaN
2014-01-04 06:04:56 NaN NaN 6062 4.9
2014-01-04 06:06:05 6062 107200 NaN NaN
2014-01-04 06:10:58 NaN NaN 6062 5.1
2014-01-04 06:12:07 6062 97400 NaN NaN
2014-01-04 06:17:01 NaN NaN 6062 5.2
2014-01-04 06:18:10 6062 99200 NaN NaN
2014-01-04 06:23:03 NaN NaN 6062 5.2
2014-01-04 06:24:12 6062 91300 NaN NaN
2014-01-04 06:29:05 NaN NaN 6062 5.5
2014-01-04 06:30:14 6062 84100 NaN NaN
2014-01-04 06:35:08 NaN NaN 6062 5.5
2014-01-04 06:36:17 6062 57000 NaN NaN
Next, you need to interpolate to fill in the missing values. I interpolate using 'time'
mode
so it properly handles the time indexes:
df = df.interpolate('time')
This gives you a DataFrame
like this:
a b
Volume Value Volume Value
Date
2014-01-04 05:52:50 NaN NaN 6062 4.700000
2014-01-04 05:58:53 NaN NaN 6062 4.700000
2014-01-04 06:00:02 6062 108000.000000 6062 4.738017
2014-01-04 06:04:56 6062 107352.066116 6062 4.900000
2014-01-04 06:06:05 6062 107200.000000 6062 4.938122
2014-01-04 06:10:58 6062 99267.955801 6062 5.100000
2014-01-04 06:12:07 6062 97400.000000 6062 5.119008
2014-01-04 06:17:01 6062 98857.851240 6062 5.200000
2014-01-04 06:18:10 6062 99200.000000 6062 5.200000
2014-01-04 06:23:03 6062 92805.801105 6062 5.200000
2014-01-04 06:24:12 6062 91300.000000 6062 5.257182
2014-01-04 06:29:05 6062 85472.375691 6062 5.500000
2014-01-04 06:30:14 6062 84100.000000 6062 5.500000
2014-01-04 06:35:08 6062 62151.239669 6062 5.500000
2014-01-04 06:36:17 6062 57000.000000 6062 5.500000
I think generally it would be best to stop here, since you keep all data from all csv
files. But you said you want only the time points from the longest csv
. To get that, you need to find the longest DataFrame
, and then get the rows corresponding to its indexes. Finding the longest DataFrame
is easy, you just find the one with the maximum length. Keeping only the time points in that index
is also easy, you just slice using that index
(you use the loc
method for slicing in this way).
longind = max(dataList.values(), key=len).index
df = df.loc[longind]
This gives you the following final DataFrame
:
a b
Volume Value Volume Value
Date
2014-01-04 05:52:50 NaN NaN 6062 4.7
2014-01-04 05:58:53 NaN NaN 6062 4.7
2014-01-04 06:04:56 6062 107352.066116 6062 4.9
2014-01-04 06:10:58 6062 99267.955801 6062 5.1
2014-01-04 06:17:01 6062 98857.851240 6062 5.2
2014-01-04 06:23:03 6062 92805.801105 6062 5.2
2014-01-04 06:29:05 6062 85472.375691 6062 5.5
2014-01-04 06:35:08 6062 62151.239669 6062 5.5
This can be combined into one line if you want:
df = pd.concat(dataList.values(), axis=1, keys=dataList.keys()).interpolate('time').loc[max(dataList.values(), key=len).index]
Or, perhaps a slightly clearer 4 lines:
names = dataList.keys()
dfs = dataList.values()
longind = max(dfs, key=len).index
df = pd.concat(dfs, axis=1, keys=names).interpolate('time').loc[longind]
I am not sure why my final results are different than what you show. I ran your example in MATLAB
(R2015A) myself and got the same results as I get here, so I suspect you generated the final data with a different data set than the example.