I am doing a machine learning project with phone sensor data (accelerometer). I need to preprocess dataset before I export it to the ML model. I have 25 classes (alphabets in the datasets) and there are 20 subjects (how many times I got the alphabet) for each class. Since the lengths are different for each class and subject, I have to resample. I want to split a single csv file by class and subject to be able to resample. I have tried some things like groupby() or other things but did not work. I will be glad if you can share thoughts what I can do about this problem. This is my first time asking a question on this site if I made a mistake I would appreciate it if you warn me about my mistakes. Thank you from now.
I share some code and outputs to help you understand my question better.
what i got when i tried with groupby() but not exactly what i wanted
This is how my csv file looks like. It contains more than 300,000 data.
Some code snippet:
import pandas as pd
import numpy as np
def read_data(file_path):
data = pd.read_csv(file_path)
return data
# read csv file
dataset = read_data('raw_data.csv')
df1 = pd.DataFrame( dataset.groupby(['alphabet', 'subject'])['x_axis'].count())
df1['x_axis'].head(20)
I also need to do this for every x_axis, y_axis and z_axis so what can I use other than groupby() function? I do not want to use only the lengths but also the values of all three to be able to resample.
First, calculate the greatest common number of sample
num_sample = df.groupby(['alphabet', 'subject'])['x_axis'].count().min()
Now you can sample
df.groupby(['alphabet', 'subject']).sample(num_sample)