Search code examples

Python: "Binning" subarrays

I am seeking to make a kind of binning of lines of data according to the first element of the line.

My data has this shape:

[[Temperature, value0, value1, ... value249]
 [Temperature, ...

So to say: The first element of each line is a temperature value, the rest of the line is a time trace of a signal.

I would like to make an array of this shape:

                     ... ]]
 Next Temp.-bin, [[values]
                     ... ]]

where the lines from the original data-array should be sorted in the subarray of the respective temperature bin.

data= np.array([values]) # shape is [temp+250 timesteps,400K]





for i in np.arange(len(temp)):

I was hoping to get a result array as described above, where every line consists of the mean temperature of the bin (bincenters[i]) and an array of time traces. Python gives me an error regarding "setting an array element with a sequence. I could create this kind of array, consisting of different data types, in another script before, but there I had to define it specifically, which is not possible in this case because I'm handling files on the scale of several 100K lines of data. At the same point I would like to use as many built-in functions and the least possible loops, because my computer is already taking some time to process files of that size.

Thank you for your input,



  • First: Thanks to kwinkunks for the hint of using a pandas dataframe. I found a solution using this feature.

    The binning is now done like this:

    df=pd.concat([temp_pd,tempbins,traces], keys=['Temp','Bincenter','Traces'], axis=1)

    by defining bins (in this case even-sized). The variable "tempbins" is of the same shape as temp (the "raw" temperature) and assignes every line of data to a certain bin.

    The actual analysis is then extremely short. Starting with:

    rf=pd.DataFrame({'Bincenter': bincenters})

    the resultframe ("rf") starts with the bincenters (as the x-axis in a plot later), and simply adds columns for the desired results.



    I can select only those data lines from df, that I want to have in the selected bin.

    In my case, I am not interested in the actual time traces, but in the sum or the average, so I use lambda-functions, that run through the rows of rf and searches for every row in df, that has the same value in "Bincenter" there.

    rf['Binsize']=rf.apply(lambda row: len(df.Traces[df.Bincenter==row.Bincenter]), axis=1)
    rf['Trace_sum']=rf.apply(lambda row: sum(df.Traces[df.Bincenter==row.Bincenter]), axis=1)

    With those, another column is added to the resultframe rf for the sum of the traces and the number of lines in the bin.

    I performed some fits of the traces in rf.Trace_sum, which I did not in pandas.

    Still, the dataframe was very useful here. I used odr for fitting like this

    for i in binnumber:
        ... some more fit stuff here...

    and saved the fitresults in

    lifetimefits=pd.DataFrame({'lifetime': fitresult[:,1], 'sd_lifetime':fitresult[:,4]})

    and finally added them in the resultframe with

    rf[['Bincenter','Binsize','lifetime','sd_lifetime']].to_csv('results.csv', header=True, index=False)

    which makes an output like

        Bincenter  Binsize  ...   lifetime  sd_lifetime
    0       139.0     4102  ...  38.492028     2.803211
    1       140.0     4252  ...  33.659729     2.534872
    2       141.0     3785  ...  31.220312     2.252104
    3       142.0     3823  ...  29.391562     1.783890
    4       143.0     3808  ...  40.422578     2.849545

    I hope, this explanation might help others to not waste time, trying this with numpy. Thanks again to kwinkunks for his very helpful advice to use the pandas DataFrame.

    Best, lepakk