Python: "Binning" subarrays

I am seeking to make a kind of binning of lines of data according to the first element of the line.

My data has this shape:

[[Temperature, value0, value1, ... value249]
 [Temperature, ...
]

So to say: The first element of each line is a temperature value, the rest of the line is a time trace of a signal.

I would like to make an array of this shape:

[Temperature-bin,[[values]
                  [values]
                     ... ]]
 Next Temp.-bin, [[values]
                  [values]
                     ... ]]
...
]

where the lines from the original data-array should be sorted in the subarray of the respective temperature bin.

data= np.array([values]) # shape is [temp+250 timesteps,400K]
temp=data[0]

start=23000
end=380000

tempmin=np.min(temp[start:end])
tempmax=np.max(temp[start:end])

binsize=1
bincenters=np.arange(np.round(tempmin),np.round(tempmax)+1,binsize)

binneddata=np.empty([len(bincenters),2])

for i in np.arange(len(temp)):
    binneddata[i]=[bincenters[i],np.array([])]

I was hoping to get a result array as described above, where every line consists of the mean temperature of the bin (bincenters[i]) and an array of time traces. Python gives me an error regarding "setting an array element with a sequence. I could create this kind of array, consisting of different data types, in another script before, but there I had to define it specifically, which is not possible in this case because I'm handling files on the scale of several 100K lines of data. At the same point I would like to use as many built-in functions and the least possible loops, because my computer is already taking some time to process files of that size.

Thank you for your input,

lepakk

Solution

First: Thanks to kwinkunks for the hint of using a pandas dataframe. I found a solution using this feature.

The binning is now done like this:

tempmin=np.min(temp[start:end])
tempmax=np.max(temp[start:end])

binsize=1
bincenters=np.array(np.arange(np.round(tempmin),np.round(tempmax)+1,binsize))
lowerbinedges=np.array(bincenters-binsize/2)
higherbinedges=np.array(bincenters+binsize/2)

allbinedges=np.append(lowerbinedges,higherbinedges[-1])

temp_pd=pd.Series(temp[start:end])
traces=pd.Series(list(data[start:end,0:250]))


tempbins=pd.cut(temp_pd,allbinedges,labels=bincenters)

df=pd.concat([temp_pd,tempbins,traces], keys=['Temp','Bincenter','Traces'], axis=1)

by defining bins (in this case even-sized). The variable "tempbins" is of the same shape as temp (the "raw" temperature) and assignes every line of data to a certain bin.

The actual analysis is then extremely short. Starting with:

rf=pd.DataFrame({'Bincenter': bincenters})

the resultframe ("rf") starts with the bincenters (as the x-axis in a plot later), and simply adds columns for the desired results.

With

df[df.Bincenter==xyz]

I can select only those data lines from df, that I want to have in the selected bin.

In my case, I am not interested in the actual time traces, but in the sum or the average, so I use lambda-functions, that run through the rows of rf and searches for every row in df, that has the same value in "Bincenter" there.

rf['Binsize']=rf.apply(lambda row: len(df.Traces[df.Bincenter==row.Bincenter]), axis=1)
rf['Trace_sum']=rf.apply(lambda row: sum(df.Traces[df.Bincenter==row.Bincenter]), axis=1)

With those, another column is added to the resultframe rf for the sum of the traces and the number of lines in the bin.

I performed some fits of the traces in rf.Trace_sum, which I did not in pandas.

Still, the dataframe was very useful here. I used odr for fitting like this

for i in binnumber:
    fitdata=odr.Data(time[fitstart:],rf.Trace_sum.values[i][fitstart:])
    ... some more fit stuff here...

and saved the fitresults in

lifetimefits=pd.DataFrame({'lifetime': fitresult[:,1], 'sd_lifetime':fitresult[:,4]})

and finally added them in the resultframe with

rf=pd.concat([rf,lifetimefits],axis=1)
rf[['Bincenter','Binsize','lifetime','sd_lifetime']].to_csv('results.csv', header=True, index=False)

which makes an output like

Out[78]: 
    Bincenter  Binsize  ...   lifetime  sd_lifetime
0       139.0     4102  ...  38.492028     2.803211
1       140.0     4252  ...  33.659729     2.534872
2       141.0     3785  ...  31.220312     2.252104
3       142.0     3823  ...  29.391562     1.783890
4       143.0     3808  ...  40.422578     2.849545

I hope, this explanation might help others to not waste time, trying this with numpy. Thanks again to kwinkunks for his very helpful advice to use the pandas DataFrame.

Best, lepakk