Well, there is no set rule.
Good guidelines about the number of hidden layers that are required is the complexity of the data and the patterns. If you have 2 features, more than 2 hidden layers would rarely be useful – the point being that there are not many different ways in which the features can interact.
However, with 10 features, the combinations and interactions between them becomes more pronounced. With 100 – even more so (in fact, exponentially so). In the same way, the number of observations is also very important!
Therefore, the more information / patterns / interactions between variables there are, the deeper (and wider) the net you can build.
Please note that in the MNIST and the Business case that we use, we don’t have enough data to achieve significantly better results after the 2nd hidden layer.