Search code examples
pythontensorflowmachine-learningartificial-intelligence

dividing big dataset python


My dataset Features shape is (80102, 2592) and label.shape (80102, 2). I want to consider only few rows for traning as it is taking lot of time for training the CNN model. How can I divide the dataset in python and consider only few rows for traning and tesing both.


Solution

  • If your data is in the form of arrays let X be the array containing the data and y be the array containing the labels. You can use sklearn train_test_split function to create new samples of the data per the code below

    from sklearn.model_selection import train_test_split
    percent=.1 specify the percentof data you want to use, in this case 10%
    X_data, X_dummy, y_labels, y_dummy=train_test_split(X,y,train_size=percent,randon_state=123, shuffle=True)
    

    X_data will contain 10% of the original data and will be shuffled y_labels will contain 10% of the corresponding labels. If you want to specifically set the number of samples set train_size to an integer value. If you need further information the documentation is located here. If you data is a pandas dataframe you can use the pandas function pandas.DataFrame.sample..Documentation for that is here.. Assume your data frame is called data. The code below will produce a new data frame with a specified percent of the original rows

    percent=.1
    new_data=pandas.data.sample(n=None, frac=percent, replace=False, weights=None, random_state=123, axis=0)