# Weight Initialization in Pytorch

PyTorch is a powerful release from Facebook that enables easy implementation of neural networks with great GPU acceleration capabilities. PyTorch enables dynamic computing of graphs that change during training and forward propagation. The library also has some of the best traceback systems of all the deep learning libraries due to this dynamic computing of graphs. It is actually possible to figure out what went wrong, unlike some other libraries. One of the absolute best features of PyTorch is the smooth integration with Python so that libraries such as Pandas and Numpy can be used.

## All About Pytorch Wigh Initialization

**Importance of Weights in neural networks.**

Perhaps the most important aspect of a neural network are the weights that are present within the model after training. This is almost the entirety of the model. Companies spend billions of dollars determining the weights of complicated neural networks so that predictions can be made.

Thousands of iterations allow sweeping of the training data to determine the best relationships to the weights at each layer and the output target variable. In order to start off with training though, we need to start with some value for the weights in the network. From there, the algorithm can work to tweak the weights based on the training data through a process called backpropagation with a process called stochastic gradient descent that I covered in another post.

**Weight Initialization**

There are several ways that weights can be initialized in general. After we discuss this, I will show how to specifically do this in PyTorch. So, how can weights be initialized in neural networks? There are three main ways:

**Random initialization**

In these scenarios, the weights are completely randomly chosen. While it certainly makes it simple, there are some huge problems that can occur with the gradients if a high or low weight is chosen for the wrong connection.

The gradient can explode which means that the slope becomes far too steep and throws off everything else in the network.

Similarly, the gradient can vanish if the weight is initialized with too small a value. A solution to guard against gradients exploding or vanishing is using the rectified Linearized Unit as the activation function. The RELU function below shows that negative inputs are scaled to zero.

**Initializing weights with a fixed value**

Weights can also be initialized with a fixed value. A common weight to start with is 0. As stated in this Machine Learning Mastery post, the network would not be able to update the weights easily in this case and the model would effectively become stuck. Two neurons in the hidden layer connected to the same input *must* have different weights, otherwise they will be updated in *the same* way and learning will not occur. This creates a big problem, whether the weights are all initialized as 0’s or 1’s, the network can’t learn with constant weights amongst all inputs.

**The best way to initialize weights**

The best way to initialize weights is to make a modification to process #1. Randomly initializing the weights ensures that they are different, but could still lead to exploding and vanishing gradients problems. Setting an upper and lower bound on the weight values that are randomly chosen is the best way. A common method for setting these bounds is by choosing in the range of [-y,y] where y = 1/sqrt(n)

In this equation, n is the number of connections to a neuron. This ensures that the weight values will not be too high or too low.

**Implementing with Pytorch**

By default, PyTorch initializes the neural network weights as random values as discussed in method 3 of weight initializiation. Taken from the source PyTorch code itself, here is how the weights are initialized in linear layers:

stdv = 1. / math.sqrt(self.weight.size(1))

self.weight.data.uniform_(-stdv, stdv)

The first line identifies the number of input nodes to the network and the second line picks random values from a uniform distribution between the two values of y or stdv.

I would suggest that for most applications of a neural network, this type of initialization does *not* need to be adjusted.

**Concluding Remarks**

Pytorch is a powerful package to do robust neural network computations. Key to these neural networks are the weights that are calculated and used for future predictions. The way in which the weights are initialized is a very important factor in determining whether or not the model will find a absolute minimum of the objective function or be caught in a local minimum.