In LSTM, we usually use sigmoid function to mimic the gates mechanism (soft), but the problem is in a lot of cases, such function gives a value around 0.5, which does not mean anything in terms of gates. Why don't use binary value (0/1) in LSTM, what is the basic idea and intuition using sigmoid function in LSTM and GRU?
A binary function in your network would cause problems with backpropagation, since its not a 'nicely differentiable' function (the delta function, which is its derivative, won't play nice in numeric computations)