In performing linear algebra operations how does deep learning vary the number of neurons in each layer and maintain correct algebra operations ?
This code generates matrices of various dimensions and calculates dot product, I've attempted to just concentrate on core of operation of moving values through a network.
import numpy as np
def gm(m , n):
return np.random.uniform(1 , 1 , (m , n))
x = gm(4,2)
print('x' , x)
m1 = gm(2,3)
print('m1' , m1)
d1 = np.dot(x , m1)
print('d1' , d1)
prints :
x [[1. 1.]
[1. 1.]
[1. 1.]
[1. 1.]]
m1 [[1. 1. 1.]
[1. 1. 1.]]
d1 [[2. 2. 2.]
[2. 2. 2.]
[2. 2. 2.]
[2. 2. 2.]]
If the output matrix is a 2x1 how is this produced ? In other words how to produce a 2x1 matrix from matrix d1 ?
Frameworks such as keras
abstract this , but how does abstraction take place ?
Mapping is relatively simple, and I will focus purely on MLPs (as other neural networks, with complex structure do many other things, but the core idea is the same).
Say your input is a batch of size [B x D], where B - number of samples (can be 1), and D - number of features (input dimensions). As the output you want to have [B x K] where K - number of outputs.
Typical MLP is just a sequence of affine transformations followed by some (point-wise) nonlinearities f:
h_{i+1} = f( h_i W_i + B_i)
where h_0 = X (inputs), and h_N is the output
Say I want to have one hidden layer, with Z neurons, then all I have to do is create two matrices W and two vectors B, one pair will be used to map inputs to Z dimensions, and the other to map from Z to the desired K:
W_1 is D x Z
B_1 is 1 x Z
W_2 is Z x K
B_2 is 1 x K
Consequently we have
f(X W_1 + B_1) W_2 + B_2
X W_1 is [B x D] [D x Z] = [B x Z]
X W_1 + B_1 is [B x Z] + [1 x Z] = [B x Z] # note that summation
# is overloaded in the sense that it is adding the
# same B to each row of the argument
f(X W_1 + B_1) is [B x Z]
f(X W_1 + B_1) W_2 is [B x Z] [Z x K] = [B x K]
f(X W_1 + B_1) W_2 + B_2 is [B x K] + [1 x K] = [B x K]
Analogously with more hidden layers, you just matrix multiply on the right with the matrix of size [dimension_of_previous_one x desired_output_dimension], which is just regular linear projection operation from mathematics (and biases make it affine)