Search code examples
pythontensorflowkerasneural-networkroot-framework

Creating training and testing data for neural network using multiple input data


For the neural network I need to separate my data into testing and training which I do using train_test_split from sklearn.

Hopefully this makes it clear.

1.) I have multiple data files:df1, df2, df3, df4 ,df5, .... (root files)

2.) I want to combine these data files into one large data file

3.) I need to split the large data file into training and testing data

4.) I need to ensure that the training data is equally taken from each input data file. I do not want the training data to take 10% from df5, 20%, from d4, 30% from d3, 1% from d2 and 39% from d1. But I do want 20% of the training data take evenly from df1, df2, df3, df4 and df5.

5.) train_test_split from sklearn will not make this split evenly therefore I created a for loop as shown below which creates issues

6.) I need help coming up with another route to extract the training and testing data from the data files because my for loop is not working

What have I done so far

Step 1: Created list for training and testing

Input_Train []
Input_Test = []

Output_Train = []
Output_Test = []

Step 2: (done on a for-loop for each file) Open each individual file and does stuff to extract data

Step 3: (still in for-loop) Using sklearn split the data into training and testing for each individual data file

InputFeature_train, InputFeature_test, OutputLabel_train, OutputLabel_test = train_test_split( input_features, output_Labels, train_size = 0.2)

Store the output into the list created in step 1 above 

Input_Train.append(InputFeature_train)
Input_Test.append(InputFeature_test)
Output_Train.append(OutputLabel_train)
Output_Test.append(OutputLabel_test)

At this point I have all of the data stored into list

Step 4: (Outside of for loop) Convert the list into numpy arrays

Input_Train.np.asarray(Input_Train)
Input_Test.np.asarray(Input_Test)
Output_Train.np.asarray(Output_Train)
Output_Test.np.asarray(Output_Test)

Step 5: When I print out the shape (Input_Train.shape) I get

Input train shape (2,) --> Not what I need

Step 6: (Doing some googling ) I found that someone else had a similar issues ValueError: cannot select an axis to squeeze out which has size not equal to one

Since my dimension will be more than 1, I cannot squeeze (or reshape) this numpy array (which might have a dimension of 10+ )

But following the comments I did

Input_Train[0].shape --> I got the right values for just one data file

Step 7: My question Does anyone have a solution on how to gather the training and testing data evenly from multiple samples? The current for loop that I have setup does not work.

Step 8: Thank you in advance for any help. If you can find a link that can help me please let me know and my apologies for not finding the link myself.

Update**

#Names to get from root file
tagNames = ["m1_pt", "m2_pt", "m1_eta", "m2_eta", "m1_phi", "m2_phi"]  

#Array list
Input_Train []
Input_Test = []

Output_Train = []
Output_Test = []


data_file = ["data_file1.root", "data_file2.root"]
for df in range(len(data_file)):
        #Check to make sure file is available 
        if os.path.exists(data_file[df]):
                #Opening ROOT file 
                rfile = ROOT.TFile.Open(data_file[df])
                print("Successfully opened data file")
        else:
                #Failure to open file
                print("File not found, please try another file: ",data_file[df])

        #Get tree name from root file
        rootTreeName = "tag"

        intree = rfile.Get(rootTreeName)

        #Getting values that are does not have index =-1 (these are invalid data points)
        Ifp= tree2array(intree, branches = tagNames, selection = "index != -1")
        num_from_Ifp = Ifp.size #Size of the Ifp

        #Putting the Ifp values into a list
        input_features=[]
        for i in range(num_from_Ifp):
          input_features.append(list(Ifp[i]))

        #Converting list to an array
        input_features=np.array(input_features)

        #################################
        # Doing the same thing for the output labels
        #################################
        #output_values is a list with two values ( output_values = [0,1]). 0 corresponds to  muon in the first position and 1 corresponds to muon in the second position. There is no particular order at this point.
        Ofp = tree2array(intree, branches = ["output_values"], selection = "index != -1")
       # The line above (Ofp = ...) will tell use which muon is the correct one. It will give us the value 1 or 0
       # here is the output just to give you an idea
       # [(0,) (1,) (1,) ... (1,) (1,) (1,)]


    #Now we need to create a method to let the nn to know which muon is correct.
    #Creating an array that identifies the correct muon 
        # [ 0, 1 ] --> muon in the first position is correct
        # [ 1, 0 ] --> muon in the second position is correct

        output_Labels = np.eye(input_features.shape[0], 2)
        for i in range(input_features.shape[0]):
          output_Labels[i,0] = Ofp[i][0]
          output_Labels[i,1] = Ofp[i][0]


       #Using sklearn split the data into training and testing for each individual data file
       InputFeature_train, InputFeature_test, OutputLabel_train, OutputLabel_test = train_test_split( input_features, output_Labels, train_size = 0.2)

       Input_Train.append(InputFeature_train)
       Input_Test.append(InputFeature_test)
       Output_Train.append(OutputLabel_train)
       Output_Test.append(OutputLabel_test)



#converting list to numpy array
Input_Train = np.asarray(Input_Train)
Input_Test  = np.asarray(Input_Test )

Output_Train  = np.asarray(Output_Train)
Output_Test   = np.asarray(Output_Test )


#Printing the shape and type (note going to the first update will show you the values I get from this statement)
print("Input train shape"  , Input_Train.shape)
print("Input train type"  , type(Input_Train))
print("Input test shape"  , Input_Test.shape)
print("Output train shape"  , Output_Train.shape)
print("Ouput train type", type(Output_Train))
print("Output test shape"  , Output_Test.shape)

print("Input train [0] shape"  , Input_Train[0].shape)
print("Input train [0] type"  , type(Input_Train[0]))
print("Input test [0] shape"  , Input_Test[0].shape)
print("Output train [0] shape"  , Output_Train[0].shape)
print("Ouput train [0] type", type(Output_Train[0]))
print("Output test [0] shape"  , Output_Test[0].shape)


print("Input train [1] shape"  , Input_Train[1].shape)
print("Input train [1] type"  , type(Input_Train[1]))
print("Input test [1] shape"  , Input_Test[1].shape)
print("Output train [1] shape"  , Output_Train[1].shape)
print("Ouput train [1] type", type(Output_Train[1]))
print("Output test [1] shape"  , Output_Test[1].shape)

*Update provide the outputs of some Ifp[i] rows as well I have simplified the number of tagNames for this question. Here are all of the outputs

#Successfully opened data file --> Data File #1
Provide output as well - Ifp [(0.29471233, 0.14390038,  1.0457071 , -1.585283  ,  2.7644567,  0.3607209)
 (0.3863323 , 0.09461627,  2.1371427 ,  0.29747197,  2.7828562, -1.3958061)
 (0.30622792, 0.17158653, -1.428787  , -1.027902  ,  1.4112458, -1.6929731)
 ...
 (0.6137445 , 0.10678114, -0.6032986 , -0.4376499 , -3.0313694,  2.2488282)
 (0.39696205, 0.11587278, -0.48610905, -2.0285435 ,  1.487525 , -2.8935661)
 (0.43279374, 0.26002622,  1.6778533 ,  1.8011662 ,  1.2196023, -1.2002599)]
#Print out of Ofp [(0,) (1,) (1,) ... (1,) (1,) (1,)]

#Successfully opened data file --> Data file #2
Provide output as well - Ifp [(0.17516829, 0.15851186,  1.1060914 , -0.04039998,  1.2502128 ,  1.2712404)
 (0.42542648, 0.04648708, -0.23667681,  2.1146367 ,  0.15884383,  0.8808505)
 (0.32750598, 0.13000336, -0.78815895,  1.2549461 , -1.8893875 , -2.531463 )
 ...
 (0.38410604, 0.11572407, -0.24459349, -1.3212451 , -2.389494  , -1.6892412)
 (0.19714127, 0.10412598, -1.4186419 ,  1.8119588 , -0.47204217,  2.0771308)
 (0.3643554 , 0.11526036,  0.7342873 , -0.6727072 , -1.534371  ,  2.6366332)]
#Print out of Ofp [(1,) (1,) (1,) ... (1,) (0,) (1,)]

Input train shape (2,)
Input train  type <class 'numpy.ndarray'>
Input test  shape (2,)
Output train  shape (2,)
Ouput train  type <class 'numpy.ndarray'>
Output test  shape (2,)

Input train [0] shape (1423, 6)
Input train [0] type <class 'numpy.ndarray'>
Input test [0] shape (5692, 6)
Output train [0] shape (1423, 2)
Ouput train [0] type <class 'numpy.ndarray'>
Output test [0] shape (5692, 2)

Input train [1] shape (1408, 6)
Input train [1]  type <class 'numpy.ndarray'>
Input test [1]  shape (5634, 6)
Output train [1]  shape (1408, 2)
Ouput train [1]  type <class 'numpy.ndarray'>
Output test [1] shape (5634, 2)


Update I do not get an error but a shape issue of Input_Train which is noted under Step 5 and shown under the second Update.

Issue -->Input train shape (2,) ( I do not want this)

Printing Input_Train[0].shape -->Input train [0] shape (1423, 6)

Printing Input_Train[1].shape -->Input train [1] shape (1408, 6)

What I want is to add these to arrays (Input_Train[0] and Input_Train[1] so that my shape gives Input_Train.shape --> (2831, 6)

not Input_Train.shape (2,).


Solution

  • The solution to my problem is posted below. The answer to my problem was solved with help from @meyere_mit_ai and Minh-Long Luu, thank you!

    A simplified version

    #Define Arrays for Train and Test for both Input and Output arrays
    nn_InputFeatures_Train     = []
    nn_InputFeatures_Test      = []
    
    nn_OutputputFeatures_Train = []
    nn_OutputputFeatures_Test  = []
    
    InputFeature_train, InputFeature_test, OutputLabel_train, OutputLabel_test = train_test_split( input_features, output_Labels, train_size = 0.2)
            for e in InputFeature_train:
                    nn_InputFeatures_Train.append(list((e)))
            for f in InputFeature_test:
                    nn_InputFeatures_Test.append(list((f)))
            for g in OutputLabel_train:
                    nn_OutputputFeatures_Train.append(list((g)))
            for h in OutputLabel_test:
                    nn_OutputputFeatures_Test.append(list((h)))
    
    #Converting Testing, Validation and Training data into arrays
    nn_InputFeatures_Train = np.asarray(nn_InputFeatures_Train )
    nn_InputFeatures_Test  = np.asarray(nn_InputFeatures_Test  )
    
    nn_OutputLabels_Train  = np.asarray(nn_OutputLabels_Train  )
    nn_OutputLabels_Test   = np.asarray(nn_OutputLabels_Test   )
    
    #Now the arrays are properly stacked how I want them. If anyone has a better solution, feel free to comment! :)