Search code examples
pythoncsvandroid-sensorsresamplingdata-preprocessing

Splitting a single large csv file to resample by two columns


I am doing a machine learning project with phone sensor data (accelerometer). I need to preprocess dataset before I export it to the ML model. I have 25 classes (alphabets in the datasets) and there are 20 subjects (how many times I got the alphabet) for each class. Since the lengths are different for each class and subject, I have to resample. I want to split a single csv file by class and subject to be able to resample. I have tried some things like groupby() or other things but did not work. I will be glad if you can share thoughts what I can do about this problem. This is my first time asking a question on this site if I made a mistake I would appreciate it if you warn me about my mistakes. Thank you from now.

I share some code and outputs to help you understand my question better.

what i got when i tried with groupby() but not exactly what i wanted

This is how my csv file looks like. It contains more than 300,000 data.

Some code snippet:

import pandas as pd
import numpy as np

def read_data(file_path):
    data = pd.read_csv(file_path)
    return data

# read csv file
dataset = read_data('raw_data.csv')

df1 = pd.DataFrame( dataset.groupby(['alphabet', 'subject'])['x_axis'].count())
df1['x_axis'].head(20)

I also need to do this for every x_axis, y_axis and z_axis so what can I use other than groupby() function? I do not want to use only the lengths but also the values of all three to be able to resample.


Solution

  • First, calculate the greatest common number of sample

    num_sample = df.groupby(['alphabet', 'subject'])['x_axis'].count().min()
    

    Now you can sample

    df.groupby(['alphabet', 'subject']).sample(num_sample)