Updated on 1 Oct 2021

What Is Xavier Initialization?

The 365 Team Published on 30 Aug 2021 8 min read

Xavier Initialization thumbnail

As humans, we exercise regularly in order to stay in shape. In technical terms, we’re optimizing our bodies by training and gaining muscle mass. Well, the same can be said for machine learning and deep learning models. While you won’t see a deep neural network hitting the gym, it still requires an equal amount of exercise as any regular Joe looking to get into shape.

Training ensures that our models will be optimized to their maximum performance. But, just like you wouldn’t bench 100kg on your first trip to the gym, you wouldn’t expect your model to pull the same weight on its first go either – not without going through the proper procedures first. That is why it’s important not to skip certain steps, such as initialization.

In this tutorial, we’ll introduce you to the concept of initialization, show you exactly why this process is necessary, and explain what the state-of-the-art Xavier initialization is all about.

What Is Initialization?

Without beating around the bush, initialization is a crucial part of machine learning.

As we already mentioned, training is highly important. We have to use the appropriate methods on our models. Using clumsy or inappropriate methods results in poor performance – in such cases, even the fastest computer in the world won’t be able to help. They say that the devil is in the details, and this saying is as true as it gets when talking about initialization.

Initialization is the process of setting the initial values of weights for our models, neural networks or other deep learning architecture. You’re probably wondering why that matters – surely these initial weights don’t matter, right? Well, it’s quite the opposite actually.

Why Is Initialization Important?

For starters, an inappropriate initialization would lead to an unoptimizable model.

Choosing the right initializer is important to our model’s performance and training. In addition, we’d be also shortening the convergence time, as well as minimizing the loss function by setting our initial weights. As you can see, the benefits are plenty.

But in case you’re still not convinced, let us illustrate the importance of initialization with an example of a model with a single hidden layer:

Xavier Initialization image 1

Let’s initialize our weights and biases in such a way they are equal to a constant – it doesn’t matter which one:

Xavier Initialization image 2

As you can see, the three hidden units are completely symmetrical to the inputs.

Each hidden unit is a function of one weight coming from x1 and one from x2. If all of these weights are equal, there is no reason for the algorithm to learn that h1, h2, and h3 are different. Forward propagating, there is no reason for the algorithm to think that even our outputs are different:

Xavier Initialization image 2

Based on this symmetry, when we’re backpropagating, all the weights are bound to be updated without distinguishing between the different nodes in the net. Some optimization would still take place, so it won’t be the initial value. Still, the weights would remain useless.

Xavier Initialization image 4

So, how are we supposed to initialize the weights then? We’ll look at a few types of initialization next.

Types of Simple Initializations

Now that we know initialization matters, let’s see how we can deal with it.

Random Initialization

A simple approach would be to initialize weights randomly within a small range. We’ll use the NumPy method: random uniform with a range between minus 0.1 and 0.1.

Xavier Initialization image 5

Even though the initialization with random weights picks the values indiscriminately, they are still chosen in a uniform manner. That means each one has the exact same probability of being chosen – it might sound intuitive, but it is important to stress it.

Normal (Naïve) Initialization

Another way we could go about this is by choosing a normal initializer. The idea is basically the same. This time, though, we pick the numbers from a zero-mean normal distribution. The chosen variance is arbitrary, but should be small. As you can guess, since it follows the normal distribution, values closer to 0 are much more likely to be chosen than other values.

An example of such initialization is to draw from a normal distribution with a mean 0 and a standard deviation 0.1:

Xavier Initialization image 6

Our initial weights and biases will be picked randomly from the interval [-0.1, 0.1] in a random normal manner, where the mean is 0 and the standard deviation is 0.1 (variance 0.01).

Although they were the norm until 2010, both methods are somewhat problematic, as they use the sigmoid function. It was only recently that academics came up with a solution.

Let’s explore what the problem with the sigmoid activation function is. 

What Are the Disadvantages of the Sigmoid Function?

Here is an example that illustrates why sigmoid could be “bad”.

Weights are used in linear combinations which we then activate:

Xavier Initialization image 7

In this case, we use the sigmoid activator. This function, like other commonly used non-linearities, is peculiar around its mean and its extremes:

Xavier Initialization image 8

Activation functions take as inputs the linear combination of the units from the previous layer, right? Well, if the weights are too small, this will cause values that fall around this range:

Xavier Initialization image 9

As you can see, unfortunately, the sigmoid is almost linear. If all our inputs are in this range (which will happen if we use small weights), our chosen function won’t apply a non-linearity to the linear combination as we want it to. Instead, it’ll apply a linearity:

Xavier Initialization image 10

Such an outcome is not ideal as non-linearities are essential for deep networks.

If values are too large or too small, however, the sigmoid is almost flat, meaning the output will be only 1s or only 0s, respectively:

Xavier Initialization image 11

A static output of the activations will minimize the gradient while we still haven’t fully trained the algorithm.

So, what we want here is a wide range of inputs for the sigmoid. These inputs depend on the weights, which will have to be initialized in a reasonable range so we can have a nice variance along the linear combinations:

Xavier Initialization image 12

We can achieve this by using a more advanced strategy – the Xavier initialization.

What Is Xavier Initialization?

This method is also known as Glorot initialization. But first, who are Xavier and Glorot?

The truth is, this is one person actually - Xavier Glorot. He is the only academic (that we know of) whose work is named after his first name, rather than his last. Quite intriguing, isn’t it?

Mr. Glorot – or simply Xavier – proposed this method in 2010 and it was quickly adopted on a large scale. This type of initialization is a state-of-the-art technique that we believe anyone interested in neural networks should get properly acquainted with.

So, let’s view an example of how Xavier initialization works.

Example of Xavier Initialization in TensorFlow

We arbitrarily chose the range for the first two cases, right? Well, the Xavier initialization addresses this issue itself.

The main idea is that the method used for randomization isn’t so important. It is the number of outputs in the following layer that matters. With the passing of each layer, the Xavier initialization maintains the variance in some bounds so that we can take full advantage of the activation functions.

There are two formulas for this strategy.

Uniform Xavier initialization: draw each weight, w, from a random uniform distribution in in [-x,x] for $x = \sqrt {\frac {6}{inputs\,+\,outputs}}$

Normal Xavier initialization: draw each weight, w, from a normal distribution with mean of 0, and a standard deviation $\sigma = \sqrt {\frac {2}{inputs\,+\,outputs}}$

Uniform Xavier Initialization

The Uniform Xavier initialization states we should draw each weight w from a random uniform distribution in the range from minus x to x, where x is equal to square root of 6, divided by the number of inputs, plus the number of outputs for the transformation.

Normal Xavier Initialization

For the normal Xavier initialization, we draw each weight w from a normal distribution with a mean of 0, and a standard deviation equal to 2, divided by the number of inputs, plus the number of outputs for the transformation.

The numerator values 2 and 6 vary across sources, but the main idea is the same.

Why Are Inputs and Outputs in Xavier Initialization Important?

Another detail we should highlight here is that the number of inputs and outputs matters.

Outputs are clear – that’s where the activation function goes. So, the higher the number of outputs, the higher the need to spread weights:

Xavier Initialization image 14

What about inputs? Well, since we achieve optimization through backpropagation, we would obviously have the same problem, but in the opposite direction:

Xavier Initialization image 15

Finally, this is the default initializer in TensorFlow:

Xavier Initialization image 16

So, if you initialize the variables without specifying how, the model will automatically adopt the Xavier initializer.

Xavier Initialization: Next Steps

Deep learning is an evolving practice. It’s important to stay on top of the trends and follow the adopted best practices that will ensure you’re on the right track, moving forward.

As we’ve seen, initialization is a crucial step in training models or neural networks. Moreover, the Xavier initialization is an innovative method that will not only save you time, but also expertly initialize your model’s weights by taking on the brunt of the work. So, it’s more than a good idea to explore it as you go along your machine learning and deep learning journey.

Learn data science with industry experts

Try For Free
The 365 Team

The 365 Data Science team creates expert publications and learning resources on a wide range of topics, helping aspiring professionals improve their domain knowledge, acquire new skills, and make the first successful steps in their data science and analytics careers.

Top