If I want to apply deep learning to the dataset from the sensors that I currently possess, I would require quite a lot data, or we may see overfitting. Unfortunately, the sensors have only been active for a month and therefore the data requires augmentation. I currently have data in the form of a dataframe that can be seen below:
index timestamp cas_pre fl_rat ...
0 2017-04-06 11:25:00 687.982849 1627.040283 ...
1 2017-04-06 11:30:00 693.427673 1506.217285 ...
2 2017-04-06 11:35:00 692.686310 1537.114807 ...
....
101003 2017-04-06 11:35:00 692.686310 1537.114807 ...
Now I want to augment some particular columns with the tsaug
package. The augmentation can be in the form of:
my_aug = (
RandomMagnify(max_zoom=1.2, min_zoom=0.8) * 2
+ RandomTimeWarp() * 2
+ RandomJitter(strength=0.1) @ 0.5
+ RandomTrend(min_anchor=-0.5, max_anchor=0.5) @ 0.5
)
The docs for the augmentation library proceed to use the augmentation in the manner below:
X_aug, Y_aug = my_aug.run(X, Y)
Upong further investigation on this site, it seems as though that the augmentation affects numpy arrays. While it states that it is a multivariate augmentation not really sure as to how that is happening effectively.
I would like to apply this consistent augmentation across the float numerical columns such as cas_pre
and fl_rat
in order not to diverge from the original data and the relationships between each of the columns too much. I would not like to appply it rows such as timestamp
. I am not sure as to how to do this within Pandas.
This is my attempt:
#Convert Pandas dataframe to Numpy array and apply tsaug transformations
import numpy as np
import pandas as pd
from tsaug import TimeWarp, Crop, Quantize, Drift, Reverse
df = pd.DataFrame({"timestamp": [1, 2],"cas_pre": [687.982849, 693.427673], "fl_rat": [1627.040283, 1506.217285]})
my_aug = (
Drift(max_drift=(0.1, 0.5))
)
aug = my_aug.augment(df[["timestamp","cas_pre","fl_rat"]].to_numpy())
print("Input:")
print(df[["timestamp","cas_pre","fl_rat"]].to_numpy()) #debug
print("Output:")
print(aug)
Console Output:
Input:
[[1.00000000e+00 6.87982849e+02 1.62704028e+03]
[2.00000000e+00 6.93427673e+02 1.50621728e+03]]
Output:
[[1.00000000e+00 9.13389853e+02 2.03588979e+03]
[2.00000000e+00 1.01536282e+03 1.43177109e+03]]
You may need to convert your timestamps to something numeric.
The tsaug functions you use don't seem to exist, so I only applied drift() as an example. After some experimentation, TimeWarp() doesn't affect timestamps (Column 1) by default, but TimeWarp()*5 inserts new samples by cloning each timestamp 5 times.