I am trying to understand this code. I do not understand how if you do:
x_validation, x_test, y_validation, y_test =
train_test_split(x_validation_and_test, y_validation_and_test...
you can later do:
(len(x_validation[y_validation == 0])
surely the train_test_split
means x_validation
and y_validation
aren't related. What am I missing?
EDIT:
There are some good answers already but I just want to clarify. Are x_validation
and y_validation
guaranteed to be in the correct order, and the same as each other. Obviously you could add a row to either and mess things up, but is there an underlying index that means order is preserved? I come from a non-python background and sometimes you could not guarantee order of things like SQL columns.
You absolutely do want the x_validation
to be related to the y_validation
, i.e. correspond to the same rows as you had in your original dataset.
e.g. if Validation takes rows 1,3,7 from the input x, you would want rows 1, 3, 7 in both the x_validation
and y_validation
.
The idea of the train_test_split
function to divide your dataset up into a two sets of features (the x
s) and the corresponding labels (the y
s). So you want and require
len(x_validation) == len(y_validation)
and
len(x_test) == len(y_test)
Looking at other parts of you question that might be causing confusion:
y_validation == 0
will generate a boolean mask of True
and False
values that you can use to select only those rows from any data frame with the same length, so in this case it will also work with x_validataion
.
As an aside,
len(x_validation[y_validation == 0])
Seems a slightly confusing way of counting the number of examples that are of class 0
. I would have gone for
(y_validation == 0).sum()
myself and then you can write the % negative calculation as
100*(y_validation == 0).sum()/len(y_validation)
which Is a bit neater to me.