In what sense the softmax is incorporated into the loss function?
Good morning. Here we have a sparse categorical cross entropy as a loss function. I think that this means that the loss function is calculated with -sum log(y_i_pred) * y_i_true, or something similar. So where do we put the softmax here? Maybe the new loss function becomes -sum softmax(log(y_i_pred) * y_i_true)?
Furthermore, I haven't understood what's the problem if we put the softmax as an activation function for all the output neurons as usual, what's the difference? Thank you
Hi,
Yes, in effect we put softmax(y_i_pred_, instead of y_i_pred, however the whole expression is expanded so to simplify it.
In most situations, there won't be any problem to state the softmax in the last layer as usual. However, sometimes, the computations may become unstable due to the rounding going on and other floating point calculation inaccuracies.
According to the folks at TensorFlow, if we incorporate the softmax directly into the loss function, the calculations are easier for the computer and there will be less mistakes due to rounding - thus, the training will be more stable.
Theoretically, there is no difference between both approaches, it's just easier for the computer, that's why I advised to incorporate the softmax into the loss.
In any case, the potential problems are very rare, so there's not much difference.
Best,
Nikola, 365 Team
Thank you for your answer. But it's still not very clear what will be the new loss function. Let's suppose that we have 2 inputs a_1 and a_2 (that is 2 features), 1 hidden layer with 3 neurons b_1, b_2 and b_3 and 8 output neurons from y_1 to y_8: usually we have that y_1 = softmax(w_11 b_1 + w_21 b_2 + w_31 b_3), that is e^(w_11 b_1 + w_21 b_2 + w_31 b_3) /(e^(w_11 b_1 + w_21 b_2 + w_31 b_3) + ... + e^(w_18 b_1 + w_28 b_2 + w_38 b_3)), and the analogue thing with y_2, y_3, ..., y_8 (in general it is y_i = e^(w_1i b_1 + w_2i b_2 + w_3i b_3) /(e^(w_11 b_1 + w_21 b_2 + w_31 b_3) + ... + e^(w_18 b_1 + w_28 b_2 + w_38 b_3))). Now, with the sparse categorical cross entropy the loss function is calculated as f = - sum log(y_i) * y_i_true = - (log(y_1) * y_1_true + ... + log(y_8) * y_8_true).
With the softmax incorporated in the loss function, instead, what is the loss function? I expect that here it's just y_i = w_1i b_1 + w_2i b_2 + w_3i b_3, while the loss function is maybe calculated as f = - sum log(softmax(y_i)) y_i_true? Or maybe f = - sum softmax(log(y_i)) y_i_true? Or what else? Thank you
Is anyone able to answer my previous question? Thank you