Search code examples
pythonpandasnumpydataframesub-array

Python: "Binning" subarrays


I am seeking to make a kind of binning of lines of data according to the first element of the line.

My data has this shape:

[[Temperature, value0, value1, ... value249]
 [Temperature, ...
]

So to say: The first element of each line is a temperature value, the rest of the line is a time trace of a signal.

I would like to make an array of this shape:

[Temperature-bin,[[values]
                  [values]
                     ... ]]
 Next Temp.-bin, [[values]
                  [values]
                     ... ]]
...
]

where the lines from the original data-array should be sorted in the subarray of the respective temperature bin.

data= np.array([values]) # shape is [temp+250 timesteps,400K]
temp=data[0]

start=23000
end=380000

tempmin=np.min(temp[start:end])
tempmax=np.max(temp[start:end])

binsize=1
bincenters=np.arange(np.round(tempmin),np.round(tempmax)+1,binsize)

binneddata=np.empty([len(bincenters),2])

for i in np.arange(len(temp)):
    binneddata[i]=[bincenters[i],np.array([])]

I was hoping to get a result array as described above, where every line consists of the mean temperature of the bin (bincenters[i]) and an array of time traces. Python gives me an error regarding "setting an array element with a sequence. I could create this kind of array, consisting of different data types, in another script before, but there I had to define it specifically, which is not possible in this case because I'm handling files on the scale of several 100K lines of data. At the same point I would like to use as many built-in functions and the least possible loops, because my computer is already taking some time to process files of that size.

Thank you for your input,

lepakk


Solution

  • First: Thanks to kwinkunks for the hint of using a pandas dataframe. I found a solution using this feature.

    The binning is now done like this:

    tempmin=np.min(temp[start:end])
    tempmax=np.max(temp[start:end])
    
    binsize=1
    bincenters=np.array(np.arange(np.round(tempmin),np.round(tempmax)+1,binsize))
    lowerbinedges=np.array(bincenters-binsize/2)
    higherbinedges=np.array(bincenters+binsize/2)
    
    allbinedges=np.append(lowerbinedges,higherbinedges[-1])
    
    temp_pd=pd.Series(temp[start:end])
    traces=pd.Series(list(data[start:end,0:250]))
    
    
    tempbins=pd.cut(temp_pd,allbinedges,labels=bincenters)
    
    df=pd.concat([temp_pd,tempbins,traces], keys=['Temp','Bincenter','Traces'], axis=1)
    

    by defining bins (in this case even-sized). The variable "tempbins" is of the same shape as temp (the "raw" temperature) and assignes every line of data to a certain bin.

    The actual analysis is then extremely short. Starting with:

    rf=pd.DataFrame({'Bincenter': bincenters})
    

    the resultframe ("rf") starts with the bincenters (as the x-axis in a plot later), and simply adds columns for the desired results.

    With

    df[df.Bincenter==xyz] 
    

    I can select only those data lines from df, that I want to have in the selected bin.

    In my case, I am not interested in the actual time traces, but in the sum or the average, so I use lambda-functions, that run through the rows of rf and searches for every row in df, that has the same value in "Bincenter" there.

    rf['Binsize']=rf.apply(lambda row: len(df.Traces[df.Bincenter==row.Bincenter]), axis=1)
    rf['Trace_sum']=rf.apply(lambda row: sum(df.Traces[df.Bincenter==row.Bincenter]), axis=1)
    

    With those, another column is added to the resultframe rf for the sum of the traces and the number of lines in the bin.

    I performed some fits of the traces in rf.Trace_sum, which I did not in pandas.

    Still, the dataframe was very useful here. I used odr for fitting like this

    for i in binnumber:
        fitdata=odr.Data(time[fitstart:],rf.Trace_sum.values[i][fitstart:])
        ... some more fit stuff here...
    

    and saved the fitresults in

    lifetimefits=pd.DataFrame({'lifetime': fitresult[:,1], 'sd_lifetime':fitresult[:,4]})
    

    and finally added them in the resultframe with

    rf=pd.concat([rf,lifetimefits],axis=1)
    rf[['Bincenter','Binsize','lifetime','sd_lifetime']].to_csv('results.csv', header=True, index=False)
    

    which makes an output like

    Out[78]: 
        Bincenter  Binsize  ...   lifetime  sd_lifetime
    0       139.0     4102  ...  38.492028     2.803211
    1       140.0     4252  ...  33.659729     2.534872
    2       141.0     3785  ...  31.220312     2.252104
    3       142.0     3823  ...  29.391562     1.783890
    4       143.0     3808  ...  40.422578     2.849545
    

    I hope, this explanation might help others to not waste time, trying this with numpy. Thanks again to kwinkunks for his very helpful advice to use the pandas DataFrame.

    Best, lepakk