python dataframe numpy indexing numpy-slicing

What is the Numpy slicing notation in this code?

# split into inputs and outputs

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

Can someone explain the second line of code with reference to specific documentation? I know its slicing but the I couldn't find any reference for the notation ":-1" anywhere. Please give the specific documentation portion.

Thank you

It results in slicing, most probably using numpy and it is being done on a data of shape (610, 14)

Solution

Per the docs:

Indexing on `ndarrays`

ndarrays can be indexed using the standard Python x[obj] syntax, where x is the array and obj the selection. There are different kinds of indexing available depending on obj: basic indexing, advanced indexing and field access.

1D array

Slicing a 1-dimensional array is much like slicing a list

import numpy as np


np.random.seed(0)
array_1d = np.random.random((5,))

print(len(array_1d.shape))

NOTE: The len of the array shape tells you the number of dimensions.

We can use standard python list slicing on the 1D array.

# get the last element
print(array_1d[-1])

0.4236547993389047

# get everything up to but excluding the last element
print(array_1d[:-1])

[0.5488135  0.71518937 0.60276338 0.54488318]

2D array

array_2d = np.random.random((5, 1))

print(len(array_2d.shape))

Think of a 2-dimensional array like a data frame. It has rows (the 0th axis) and columns (the 1st axis). numpy grants us the ability to slice these axes independently by separating them with a comma (,).

# the 0th row and all columns
# the 0th row and all columns
print(array_2d[0, :])

[0.79172504]

# the 1st row and everything after + all columns
print(array_2d[1:, :])

[[0.52889492]
 [0.56804456]
 [0.92559664]
 [0.07103606]]

# the 1st through second to last row + the last column
print(array_2d[1:-1, -1])

[0.52889492 0.56804456 0.92559664]

Your Example

# split into inputs and outputs

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

Note that data.shape is >= 2 (otherwise you'd get an IndexError).

This means data[:, :-1] is keeping all "rows" and slicing up to, but not including, the last "column". Likewise, data[:, -1] is keeping all "rows" and selecting only the last "column".

It's important to know that when you slice an ndarray using a colon (:), you will get an array with the same dimensions.

print(len(array_2d[1:, :-1].shape))  # 2

But if you "select" a specific index (i.e. don't use a colon), you may reduce the dimensions.

print(len(array_2d[1, :-1].shape))  # 1, because I selected a single index value on the 0th axis

print(len(array_2d[1, -1].shape))  # 0, because I selected a single index value on both the 0th and 1st axes

You can, however, select a list of indices on either axis (assuming they exist).

print(len(array_2d[[1], [-1]].shape))  # 1

print(len(array_2d[[1, 3], :].shape))  # 2

What is the Numpy slicing notation in this code?

Indexing on ndarrays

1D array

2D array

Your Example

Indexing on `ndarrays`