How can you add a NxM matrix to a 1xM matrix?
In your Y = XW + B example in around the 3:55 mark, how can you add a NxM matrix to a 1xM matrix? Would the resulting XW Matrix not be NxM in dimension? I suppose from an intuitive level, it makes sense that weights and intercept are not dependent on the number of observations, but your example confuses me.
We have addressed this issue in the course notes on the section. If you havent seen them, they are available for download with the first lecture from the section! Nevertheless, let me elaborate further. You are right that technically we cannot add an NxM matrix and a 1xM matrix in pure linear algebraic terms. What we have done is that we have implicitly extended the 1xM matrix to NxM by stacking its only row N times on top of each other, and then added it in the usual way to the NxM one.
Having this in mind, it's nothing but shorthand notation to allow operations such as NxM plus 1xM, as long as it is not ambiguous. In programming terms, this is referred to as broadcasting. Taken from NumPy's page on broadcasting:
"Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes." This is exactly what we've done - broadcast 1xM to NxM implicitly, in order to add it to the original NxM matrix!
Now finally, let's see why we do it this way. As you said, it makes sense intuitively that the weights and the biases (intercepts) are independent of the number of observations N. That's why the biases matrix has its first dimension equal to 1 - it is only defined for a single, independent sample. So we could take each sample one by one, and compute their outputs with N=1, and then the equations would work even without broadcasting. However, in the way we're doing it, we're effectively calculating all outputs for the N samples "in parallel". By repeating the same biases N times (stacking them on top of each other), we've achieved this parallelisation - we treat every sample independently by applying the same weights and bias to it.
A short example will hopefully clear all uncertainty on the matter. Let's say N=3, M=2, our NxM matrix is XW = [[1,2],[3,4],[5,6]], and B = [-5,10]. Then XW + B is the same as adding [[1,2],[3,4],[5,6]] + [[-5,10],[-5,10],[-5,10]] = [[-4,12],[-2,14],[0,16]]. Naturally, on the computational side, the computer doesn't need to go through this extra implicit step, so do not worry about computational complexity overhead.