I am seeking to make a kind of binning of lines of data according to the first element of the line.
My data has this shape:
[[Temperature, value0, value1, ... value249]
[Temperature, ...
]
So to say: The first element of each line is a temperature value, the rest of the line is a time trace of a signal.
I would like to make an array of this shape:
[Temperature-bin,[[values]
[values]
... ]]
Next Temp.-bin, [[values]
[values]
... ]]
...
]
where the lines from the original data-array should be sorted in the subarray of the respective temperature bin.
data= np.array([values]) # shape is [temp+250 timesteps,400K]
temp=data[0]
start=23000
end=380000
tempmin=np.min(temp[start:end])
tempmax=np.max(temp[start:end])
binsize=1
bincenters=np.arange(np.round(tempmin),np.round(tempmax)+1,binsize)
binneddata=np.empty([len(bincenters),2])
for i in np.arange(len(temp)):
binneddata[i]=[bincenters[i],np.array([])]
I was hoping to get a result array as described above, where every line consists of the mean temperature of the bin (bincenters[i]) and an array of time traces. Python gives me an error regarding "setting an array element with a sequence. I could create this kind of array, consisting of different data types, in another script before, but there I had to define it specifically, which is not possible in this case because I'm handling files on the scale of several 100K lines of data. At the same point I would like to use as many built-in functions and the least possible loops, because my computer is already taking some time to process files of that size.
Thank you for your input,
lepakk
First: Thanks to kwinkunks for the hint of using a pandas dataframe. I found a solution using this feature.
The binning is now done like this:
tempmin=np.min(temp[start:end])
tempmax=np.max(temp[start:end])
binsize=1
bincenters=np.array(np.arange(np.round(tempmin),np.round(tempmax)+1,binsize))
lowerbinedges=np.array(bincenters-binsize/2)
higherbinedges=np.array(bincenters+binsize/2)
allbinedges=np.append(lowerbinedges,higherbinedges[-1])
temp_pd=pd.Series(temp[start:end])
traces=pd.Series(list(data[start:end,0:250]))
tempbins=pd.cut(temp_pd,allbinedges,labels=bincenters)
df=pd.concat([temp_pd,tempbins,traces], keys=['Temp','Bincenter','Traces'], axis=1)
by defining bins (in this case even-sized). The variable "tempbins" is of the same shape as temp (the "raw" temperature) and assignes every line of data to a certain bin.
The actual analysis is then extremely short. Starting with:
rf=pd.DataFrame({'Bincenter': bincenters})
the resultframe ("rf") starts with the bincenters (as the x-axis in a plot later), and simply adds columns for the desired results.
With
df[df.Bincenter==xyz]
I can select only those data lines from df, that I want to have in the selected bin.
In my case, I am not interested in the actual time traces, but in the sum or the average, so I use lambda-functions, that run through the rows of rf and searches for every row in df, that has the same value in "Bincenter" there.
rf['Binsize']=rf.apply(lambda row: len(df.Traces[df.Bincenter==row.Bincenter]), axis=1)
rf['Trace_sum']=rf.apply(lambda row: sum(df.Traces[df.Bincenter==row.Bincenter]), axis=1)
With those, another column is added to the resultframe rf for the sum of the traces and the number of lines in the bin.
I performed some fits of the traces in rf.Trace_sum, which I did not in pandas.
Still, the dataframe was very useful here. I used odr for fitting like this
for i in binnumber:
fitdata=odr.Data(time[fitstart:],rf.Trace_sum.values[i][fitstart:])
... some more fit stuff here...
and saved the fitresults in
lifetimefits=pd.DataFrame({'lifetime': fitresult[:,1], 'sd_lifetime':fitresult[:,4]})
and finally added them in the resultframe with
rf=pd.concat([rf,lifetimefits],axis=1)
rf[['Bincenter','Binsize','lifetime','sd_lifetime']].to_csv('results.csv', header=True, index=False)
which makes an output like
Out[78]:
Bincenter Binsize ... lifetime sd_lifetime
0 139.0 4102 ... 38.492028 2.803211
1 140.0 4252 ... 33.659729 2.534872
2 141.0 3785 ... 31.220312 2.252104
3 142.0 3823 ... 29.391562 1.783890
4 143.0 3808 ... 40.422578 2.849545
I hope, this explanation might help others to not waste time, trying this with numpy. Thanks again to kwinkunks for his very helpful advice to use the pandas DataFrame.
Best, lepakk