I have two datasets, one with consumption of energy and one with production of energy. I merged these two and filtered out all of the surplus energy peaks from this. This resulted in a dataframe with lots of peaks and zeros for all moments there is no surplus energy.
What I am looking for is to find the amount of energy in each peak. More or less this means finding the area under each indivivual peak, from the moment it starts from zero to returning to zero.
I -unsucesfully- tried to split the peaks everytime the graph hits zero. I simply have no idea on how to code something that will split the dataset into individual peaks or -for that matter- calculate how much energy there is in every peak. It is too much data to do this by hand (almost a year of data in 15-minute intervals).
Simply summing all the data and dividing by the amount of datapoints will not cut it.
I hope it is clear what I'm trying to achieve here. Thanks
EDIT
Let's say the data looks something like this:
df = pd.DataFrame()
df['Value'] = [0, 0, 0, 0, 1, 1, 0, 1, 2, 4, 3, 0, 4, 0, 1, 0, 4, 0,
1, 0]
df['Timestamp'] = pd.date_range(start='1/1/2018', periods=len(df),
freq='15T')
df.plot(x='Timestamp', y='Value')
I can't split it on the zeros with partitioning, I cannot find anything online where a dataset is split everytime there is a zero. Thanks.
I have something for you, but (a) it may have off-by-one errors, and (b) there needs to be some manual fudging at the first and last rows of the dataframe, if Value
isn't zero for these rows. Disclaimers dispensed, here goes.
First, (1) put in columns indicating when a row is the beginning of a shift, and when it's at the end. At least for me, this involved some fumbling around with the parameters of the shift()
calls. (2) Calculate the area under the graph for the 15-min time period represented by each row, using the trapezium rule. (3) Add a column to store ID's of each peak later. So far we have:
# (1)
df["start_peak"] = df["Value"].shift(-1).ne(0) & df["Value"].eq(0)
df["end_peak"] = (df["Value"].shift(-1).eq(0) & df["Value"].ne(0)).shift(1)
# (2)
df["Area"] = 0.5 * 0.25 * (df["Value"] + df["Value"].shift(-1))
# (3)
df["peak_ID"] = pd.NaT # (3)
Now we need to loop through the rows of the dataframe and assign peak IDs to each row. The logic I chose was: (4) if it's the start of a peak, then the ID of the peak is the timestamp, (5) if it's not in a peak, then the ID is undefined (pd.NaT
), and (6) otherwise the ID is the timestamp from the beginning of the peak. Note that there are a few ways to iterate down the rows of a dataframe (iterrows, iteritems, itertuples
), but in general you should avoid it when you can. I don't think we can avoid it here.
previous_peak_id = pd.NaT
for (i, row) in df.iterrows():
if row["start_peak"]: # (4)
df.loc[i, "peak_id"] = row["Timestamp"] # Just setting row["peak_id"] does not affect the main dataframe.
previous_peak_id = row["Timestamp"]
elif not(row["end_peak"]): # (6)
df.loc[i, "peak_id"] = previous_peak_id
else:
pass # (5) already assigned a pd.NaT
Finally, we group the rows of the dataframes into groups where the peak_id
are the same, and sum the Area
for each group. This adds up the trapezium slices for each peak_id
, thus summing the area under each peak.
df.groupby(["peak_id"]).sum()["Area"]