For the neural network I need to separate my data into testing and training which I do using train_test_split from sklearn.
Hopefully this makes it clear.
1.) I have multiple data files:df1, df2, df3, df4 ,df5, .... (root files)
2.) I want to combine these data files into one large data file
3.) I need to split the large data file into training and testing data
4.) I need to ensure that the training data is equally taken from each input data file. I do not want the training data to take 10% from df5, 20%, from d4, 30% from d3, 1% from d2 and 39% from d1. But I do want 20% of the training data take evenly from df1, df2, df3, df4 and df5.
5.) train_test_split from sklearn will not make this split evenly therefore I created a for loop as shown below which creates issues
6.) I need help coming up with another route to extract the training and testing data from the data files because my for loop is not working
What have I done so far
Step 1: Created list for training and testing
Input_Train []
Input_Test = []
Output_Train = []
Output_Test = []
Step 2: (done on a for-loop for each file) Open each individual file and does stuff to extract data
Step 3: (still in for-loop) Using sklearn split the data into training and testing for each individual data file
InputFeature_train, InputFeature_test, OutputLabel_train, OutputLabel_test = train_test_split( input_features, output_Labels, train_size = 0.2)
Store the output into the list created in step 1 above
Input_Train.append(InputFeature_train)
Input_Test.append(InputFeature_test)
Output_Train.append(OutputLabel_train)
Output_Test.append(OutputLabel_test)
At this point I have all of the data stored into list
Step 4: (Outside of for loop) Convert the list into numpy arrays
Input_Train.np.asarray(Input_Train)
Input_Test.np.asarray(Input_Test)
Output_Train.np.asarray(Output_Train)
Output_Test.np.asarray(Output_Test)
Step 5: When I print out the shape (Input_Train.shape) I get
Input train shape (2,) --> Not what I need
Step 6: (Doing some googling ) I found that someone else had a similar issues ValueError: cannot select an axis to squeeze out which has size not equal to one
Since my dimension will be more than 1, I cannot squeeze (or reshape) this numpy array (which might have a dimension of 10+ )
But following the comments I did
Input_Train[0].shape --> I got the right values for just one data file
Step 7: My question Does anyone have a solution on how to gather the training and testing data evenly from multiple samples? The current for loop that I have setup does not work.
Step 8: Thank you in advance for any help. If you can find a link that can help me please let me know and my apologies for not finding the link myself.
Update**
#Names to get from root file
tagNames = ["m1_pt", "m2_pt", "m1_eta", "m2_eta", "m1_phi", "m2_phi"]
#Array list
Input_Train []
Input_Test = []
Output_Train = []
Output_Test = []
data_file = ["data_file1.root", "data_file2.root"]
for df in range(len(data_file)):
#Check to make sure file is available
if os.path.exists(data_file[df]):
#Opening ROOT file
rfile = ROOT.TFile.Open(data_file[df])
print("Successfully opened data file")
else:
#Failure to open file
print("File not found, please try another file: ",data_file[df])
#Get tree name from root file
rootTreeName = "tag"
intree = rfile.Get(rootTreeName)
#Getting values that are does not have index =-1 (these are invalid data points)
Ifp= tree2array(intree, branches = tagNames, selection = "index != -1")
num_from_Ifp = Ifp.size #Size of the Ifp
#Putting the Ifp values into a list
input_features=[]
for i in range(num_from_Ifp):
input_features.append(list(Ifp[i]))
#Converting list to an array
input_features=np.array(input_features)
#################################
# Doing the same thing for the output labels
#################################
#output_values is a list with two values ( output_values = [0,1]). 0 corresponds to muon in the first position and 1 corresponds to muon in the second position. There is no particular order at this point.
Ofp = tree2array(intree, branches = ["output_values"], selection = "index != -1")
# The line above (Ofp = ...) will tell use which muon is the correct one. It will give us the value 1 or 0
# here is the output just to give you an idea
# [(0,) (1,) (1,) ... (1,) (1,) (1,)]
#Now we need to create a method to let the nn to know which muon is correct.
#Creating an array that identifies the correct muon
# [ 0, 1 ] --> muon in the first position is correct
# [ 1, 0 ] --> muon in the second position is correct
output_Labels = np.eye(input_features.shape[0], 2)
for i in range(input_features.shape[0]):
output_Labels[i,0] = Ofp[i][0]
output_Labels[i,1] = Ofp[i][0]
#Using sklearn split the data into training and testing for each individual data file
InputFeature_train, InputFeature_test, OutputLabel_train, OutputLabel_test = train_test_split( input_features, output_Labels, train_size = 0.2)
Input_Train.append(InputFeature_train)
Input_Test.append(InputFeature_test)
Output_Train.append(OutputLabel_train)
Output_Test.append(OutputLabel_test)
#converting list to numpy array
Input_Train = np.asarray(Input_Train)
Input_Test = np.asarray(Input_Test )
Output_Train = np.asarray(Output_Train)
Output_Test = np.asarray(Output_Test )
#Printing the shape and type (note going to the first update will show you the values I get from this statement)
print("Input train shape" , Input_Train.shape)
print("Input train type" , type(Input_Train))
print("Input test shape" , Input_Test.shape)
print("Output train shape" , Output_Train.shape)
print("Ouput train type", type(Output_Train))
print("Output test shape" , Output_Test.shape)
print("Input train [0] shape" , Input_Train[0].shape)
print("Input train [0] type" , type(Input_Train[0]))
print("Input test [0] shape" , Input_Test[0].shape)
print("Output train [0] shape" , Output_Train[0].shape)
print("Ouput train [0] type", type(Output_Train[0]))
print("Output test [0] shape" , Output_Test[0].shape)
print("Input train [1] shape" , Input_Train[1].shape)
print("Input train [1] type" , type(Input_Train[1]))
print("Input test [1] shape" , Input_Test[1].shape)
print("Output train [1] shape" , Output_Train[1].shape)
print("Ouput train [1] type", type(Output_Train[1]))
print("Output test [1] shape" , Output_Test[1].shape)
*Update provide the outputs of some Ifp[i] rows as well I have simplified the number of tagNames for this question. Here are all of the outputs
#Successfully opened data file --> Data File #1
Provide output as well - Ifp [(0.29471233, 0.14390038, 1.0457071 , -1.585283 , 2.7644567, 0.3607209)
(0.3863323 , 0.09461627, 2.1371427 , 0.29747197, 2.7828562, -1.3958061)
(0.30622792, 0.17158653, -1.428787 , -1.027902 , 1.4112458, -1.6929731)
...
(0.6137445 , 0.10678114, -0.6032986 , -0.4376499 , -3.0313694, 2.2488282)
(0.39696205, 0.11587278, -0.48610905, -2.0285435 , 1.487525 , -2.8935661)
(0.43279374, 0.26002622, 1.6778533 , 1.8011662 , 1.2196023, -1.2002599)]
#Print out of Ofp [(0,) (1,) (1,) ... (1,) (1,) (1,)]
#Successfully opened data file --> Data file #2
Provide output as well - Ifp [(0.17516829, 0.15851186, 1.1060914 , -0.04039998, 1.2502128 , 1.2712404)
(0.42542648, 0.04648708, -0.23667681, 2.1146367 , 0.15884383, 0.8808505)
(0.32750598, 0.13000336, -0.78815895, 1.2549461 , -1.8893875 , -2.531463 )
...
(0.38410604, 0.11572407, -0.24459349, -1.3212451 , -2.389494 , -1.6892412)
(0.19714127, 0.10412598, -1.4186419 , 1.8119588 , -0.47204217, 2.0771308)
(0.3643554 , 0.11526036, 0.7342873 , -0.6727072 , -1.534371 , 2.6366332)]
#Print out of Ofp [(1,) (1,) (1,) ... (1,) (0,) (1,)]
Input train shape (2,)
Input train type <class 'numpy.ndarray'>
Input test shape (2,)
Output train shape (2,)
Ouput train type <class 'numpy.ndarray'>
Output test shape (2,)
Input train [0] shape (1423, 6)
Input train [0] type <class 'numpy.ndarray'>
Input test [0] shape (5692, 6)
Output train [0] shape (1423, 2)
Ouput train [0] type <class 'numpy.ndarray'>
Output test [0] shape (5692, 2)
Input train [1] shape (1408, 6)
Input train [1] type <class 'numpy.ndarray'>
Input test [1] shape (5634, 6)
Output train [1] shape (1408, 2)
Ouput train [1] type <class 'numpy.ndarray'>
Output test [1] shape (5634, 2)
Update I do not get an error but a shape issue of Input_Train which is noted under Step 5 and shown under the second Update.
Issue -->Input train shape (2,) ( I do not want this)
Printing Input_Train[0].shape -->Input train [0] shape (1423, 6)
Printing Input_Train[1].shape -->Input train [1] shape (1408, 6)
What I want is to add these to arrays (Input_Train[0] and Input_Train[1] so that my shape gives Input_Train.shape --> (2831, 6)
not Input_Train.shape (2,).
The solution to my problem is posted below. The answer to my problem was solved with help from @meyere_mit_ai and Minh-Long Luu, thank you!
A simplified version
#Define Arrays for Train and Test for both Input and Output arrays
nn_InputFeatures_Train = []
nn_InputFeatures_Test = []
nn_OutputputFeatures_Train = []
nn_OutputputFeatures_Test = []
InputFeature_train, InputFeature_test, OutputLabel_train, OutputLabel_test = train_test_split( input_features, output_Labels, train_size = 0.2)
for e in InputFeature_train:
nn_InputFeatures_Train.append(list((e)))
for f in InputFeature_test:
nn_InputFeatures_Test.append(list((f)))
for g in OutputLabel_train:
nn_OutputputFeatures_Train.append(list((g)))
for h in OutputLabel_test:
nn_OutputputFeatures_Test.append(list((h)))
#Converting Testing, Validation and Training data into arrays
nn_InputFeatures_Train = np.asarray(nn_InputFeatures_Train )
nn_InputFeatures_Test = np.asarray(nn_InputFeatures_Test )
nn_OutputLabels_Train = np.asarray(nn_OutputLabels_Train )
nn_OutputLabels_Test = np.asarray(nn_OutputLabels_Test )
#Now the arrays are properly stacked how I want them. If anyone has a better solution, feel free to comment! :)