Does it make sense to mix regularizers?

Does it make sense to mix regularizers? For example using L1 to select features in the first layer and use L2 for the rest?

I created this model:

model = Sequential()
# the input layer uses L1 to partially serve as a feature selection layer
model.add(Dense(10, input_dim = train_x.shape[1], activation = 'swish', kernel_regularizer=regularizers.l1(0.001)))
model.add(Dense(20, activation = 'swish', kernel_regularizer=regularizers.l2(0.001)))
model.add(Dense(20, activation = 'swish', kernel_regularizer=regularizers.l2(0.001)))
model.add(Dense(10, activation = 'softmax'))

But I'm not sure if it is a good idea to mix L1&L2, to me it seems logical to have L1 as feature selector in the input layer. But everywhere, I'm just seeing code that uses the same regularizer for all layers.

(the model seems to give quite good results, >95% correct predictions in a multiclass classification problem)

Solution

Adding different regularizations in different layers is not a problem. There are papers regarding this topic Sparse input neural network. However, a few things need attention here.

Adding l1 regularization itself in the first layer does not do feature selection. If a feature is not selected, it can not connect to any of the nodes in the next layer. l1 regularization won't be able to drop the connections of a feature totally. You will need a group lasso regularization (also called the l_{1,p} norm).
The implementation of these regularizations, especially for sparsity, is not well supported in keras itself. You will need to add thresholding functions manually in each iteration. An algorithm can be found in Sparse input neural network.