Does scikit-learn train_test_split preserve relationships?

I am trying to understand this code. I do not understand how if you do:

x_validation, x_test, y_validation, y_test = 
  train_test_split(x_validation_and_test, y_validation_and_test...

you can later do:

(len(x_validation[y_validation == 0])

surely the train_test_split means x_validation and y_validation aren't related. What am I missing?

EDIT: There are some good answers already but I just want to clarify. Are x_validation and y_validation guaranteed to be in the correct order, and the same as each other. Obviously you could add a row to either and mess things up, but is there an underlying index that means order is preserved? I come from a non-python background and sometimes you could not guarantee order of things like SQL columns.

Solution

You absolutely do want the x_validation to be related to the y_validation, i.e. correspond to the same rows as you had in your original dataset. e.g. if Validation takes rows 1,3,7 from the input x, you would want rows 1, 3, 7 in both the x_validation and y_validation.

The idea of the train_test_split function to divide your dataset up into a two sets of features (the xs) and the corresponding labels (the ys). So you want and require

len(x_validation) == len(y_validation)

and

len(x_test) == len(y_test)

Looking at other parts of you question that might be causing confusion:

y_validation == 0

will generate a boolean mask of True and False values that you can use to select only those rows from any data frame with the same length, so in this case it will also work with x_validataion.

As an aside,

len(x_validation[y_validation == 0])

Seems a slightly confusing way of counting the number of examples that are of class 0. I would have gone for

(y_validation == 0).sum()

myself and then you can write the % negative calculation as

100*(y_validation == 0).sum()/len(y_validation)

which Is a bit neater to me.