I notice an interesting behavior, could you further explain why if the loss or errors are not scaled by the number of observations the model diverges?
Happy to have you here!
Since the update rule depends on the learning rate and the dot product of the inputs and the deltas, the updates with respect to w are a function of (learning rate, inputs, deltas).
Rescaling is usually done for faster/easier learning (or modularization).
1. We can rescale the inputs (and that’s something which we actually do later on in the course), but not in this lecture.
2. We can change the learning rate. That’s something we play with in the exercises.
3. We can rescale the deltas (that’s what we do in the lecture).
Often we combine all three. In this course, though, we are showing you things one at a time.
Rescaling in the case was done to optimize the choice of the learning rate and the choice of number of iterations.
This rescaling trick shows the “deltas per observation”. Thus the learning rate we will use for 10 or 10,000 points will be the same. That’s a very useful and important property.
In order to fully understand the issue, the best thing for you to do would be the following:
You already know the values of the weights and biases at which you are aiming (and in fact you can print them at each iteration to see how does the training go). You also have the value of the loss, so that’s another point of reference (the more fundamental one). So:
1. Don’t rescale the deltas. Leave them as they are.
The easiest way to do that would be to change:
deltas_scaled = deltas / observations
deltas_scaled = deltas
The rest of the code will then remain unchanged.
2. See if the algorithm converges.
3. Change the learning rate
4. Repeat 2 and 3 until you find a satisfactory result (you’ll know when).
5. If the algorithm converges, find how many iterations it needs to converge.
6. Repeat until you find some optimal values.
Then compare that whole experience with the solution in the lecture – rescaling the deltas.
The best way to learn machine learning is to play around with the algorithm.
The 365 Team