python machine-learning keras keras-layer mnist

Keras Sequential Dense input layer - and MNIST: why do images need to be reshaped?

I'm asking this because I feel I'm missing something fundamental.

By now most everyone knows that the MNIST images are 28X28 pixels. The keras documentation tells me this about Dense:

Input shape nD tensor with shape: (batch_size, ..., input_dim). The most common situation would be a 2D input with shape (batch_size, input_dim).

So a newbie like me would assume that the images could be fed to the model as a 28*28 matrix. Yet every tutorial I found goes through various gymasntics to convert the images to a single 784-long feature.

Sometimes by

num_pixels = X_train.shape[1] * X_train.shape[2]
model.add(Dense(num_pixels, input_dim=num_pixels, activation='...'))

num_pixels = np.prod(X_train.shape[1:])
model.add(Dense(512, activation='...', input_shape=(num_pixels,)))

model.add(Dense(units=10, input_dim=28*28, activation='...'))
history = model.fit(X_train.reshape((-1,28*28)), ...)

or even:

model = Sequential([Dense(32, input_shape=(784,)), ...),])

So my question is simply - why? Can't Dense just accept an image as-is or, if necessary, just process it "behind the scenes", as it were? And if, as I suspect, this processing has to be done, is any of these methods (or others) inherently preferable?

Solution

As requested by OP (i.e. Original Poster), I will mention the answer I gave in my comment and elaborate more.

Can't Dense just accept an image as-is or, if necessary, just process it "behind the scenes", as it were?

Simply no! That's because currently the Dense layer is applied on the last axis. Therefore, if you feed it an image of shape (height, width) or (height, width, channels), Dense layer would be only applied on the last axis (i.e. width or channels). However, when the image is flattened, all the units in the Dense layer would be applied on the whole image and each unit is connected to all the pixels with different weights. To further clarify this consider this model:

model = models.Sequential()
model.add(layers.Dense(10, input_shape=(28*28,)))
model.summary()

Model summary:

Layer (type)                 Output Shape              Param #   
=================================================================
dense_2 (Dense)              (None, 10)                7850      
=================================================================
Total params: 7,850
Trainable params: 7,850
Non-trainable params: 0
_________________________________________________________________

As you can see, there are 7850 parameters in the Dense layer: each unit is connected to all the pixels (28*28*10 + 10 bias params = 7850). Now consider this model:

model = models.Sequential()
model.add(layers.Dense(10, input_shape=(28,28)))
model.summary()

Model summary:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_3 (Dense)              (None, 28, 10)            290       
=================================================================
Total params: 290
Trainable params: 290
Non-trainable params: 0
_________________________________________________________________

In this case there are only 290 parameters in the Dense layer. Here each unit in the Dense layer is connected to all the pixels as well, but the difference is that the weights are shared across the first axis (28*10 + 10 bias params = 290). It is as though the features are extracted from each row of the image compared to the previous model which extracted features across the whole image. And therefore this (i.e. weight sharing) may or may not be useful for your application.