We have done it in that way as it is the programming way to approach the problem (see broadcasting below).
We have addressed this issue in the course notes on the section. If you havent seen them, they are available for download with the lecture!
Nevertheless, let me elaborate further.
You are right that technically we cannot add an NxM matrix and a 1xM matrix in pure linear algebraic terms. What we have done is that we have implicitly extended the 1xM matrix to NxM by stacking its only row N times on top of each other, and then added it in the usual way to the NxM one. Having this in mind, it’s nothing but shorthand notation to allow operations such as NxM plus 1xM, as long as it is not ambiguous.
In programming terms, this is referred to as broadcasting. Taken from NumPy’s page on broadcasting: “Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.” This is exactly what we’ve done – broadcast 1xM to NxM implicitly, in order to add it to the original NxM matrix!
Now finally, let’s see why we do it this way. As you said, it makes sense intuitively that the weights and the biases (intercepts) are independent of the number of observations N. That’s why the biases matrix has its first dimension equal to 1 – it is only defined for a single, independent sample. So we could take each sample one by one, and compute their outputs with N=1, and then the equations would work even without broadcasting. However, in the way we’re doing it, we’re effectively calculating all outputs for the N samples “in parallel”. By repeating the same biases N times (stacking them on top of each other), we’ve achieved this parallelisation – we treat every sample independently by applying the same weights and bias to it.
A short example will hopefully clear all uncertainty on the matter. Let’s say N=3, M=2, our NxM matrix is XW = [[1,2],[3,4],[5,6]], and B = [-5,10]. Then XW + B is the same as adding [[1,2],[3,4],[5,6]] + [[-5,10],[-5,10],[-5,10]] = [[-4,12],[-2,14],[0,16]]. Naturally, on the computational side, the computer doesn’t need to go through this extra implicit step, so do not worry about computational complexity overhead.
I hope this clarifies things a bit, and feel free to ask any other questions you might have – they are the key to proper understanding!
The 365 Team