Stochastic Gradient Descent Tensorflow

Machine learning consists of machines learning patterns by minimizing differences between what it thinks the right answer is to the real right answer. Thus being able to minimize functions in the most efficient way possible is extremely vital to machine learning. One way of doing this is stochastic gradient descent which has been around since the 1950s after being proposed by the mathematicians Herbert Robbins and Sutton Monro.

To understand how stochastic gradient descent should be implement with TensorFlow, we will do a deep dive into the inner workings as well as how it differs from the regular gradient descent algorithm. Finally, we will discuss how the algorithm can be applied with TensorFlow.

Deep Dive into Stochastic Gradient Descent Tensorflow

High level

Think of a machine learning a task that you are trying to teach it. You have a bunch of examples or patterns that you want it to learn from. Now, you want it to learn it as well as possible. To do this, the algorithm tries to minimize a function as much as possible, so the machine learns the patterns you want it to learn.

After each learning step, you want to evaluate how well you did and then take a step in the right direction to make sure you’ll do better next time – otherwise you are just shooting in the dark.

Lower Level

A more technical explanation of gradient descent is that the algorithm computes the gradient (function of steepness) for the function that you are looking to minimize. The gradient can describe a very complicated steepness function which is the derivative with respect to each feature/column in your dataset.

Regular gradient descent goes through each data row and computes the steepness at a point and how best it should be changed in order to minimize the overall objective function. Regular gradient descent works by calculating dot products for each sample which could involve millions of computations depending on the dataset you are analyzing. Regular gradient descent finds the gradient of the objective function for every feature (which can be huge and computationally expensive). This will certainly find the minimum of the function in the most direct way, but in practice is this the best way to find the minimum. It turns out that the answer is no. In practice, we want speed.

However, stochastic gradient descent is more advanced and streamlined. Stochastic refers to the process of generating random numbers. SGD will semi-randomly pick a new direction to decrease the slope and will tell you the best way to adjust weights and biases to arrive at the lowest point in the trough.

Stochastic Gradient Descent

One of the benefits of a stochastic process is that local minima can be avoided due to the usage of random numbers.

Stochastic gradient descent bounces around this problem by calculating the gradient of the cost function of just 1 example. Stochastic gradient descent is preferred due to the faster training times. As one article notes, there is more noise present in the actual path to the minimum compared to batch gradient, but this is OK since we aren’t so concerned with the path (only the destination, deep I know)

Implementing with Tensorflow

Implementing gradient descent within TensorFlow is rather simple. Now that it is understood how the full SGD method works, lets see how we can put it into practice

The command: tensorflow.keras.optimizers.SGD()

Will allow the optimizer to be put into use. The TensorFlow SGD function has several parameters that need to be examined. I will investigate each of them here

By default, these are the parameters that the function is initialized with:

  • learning_rate=0.01
  • momentum=0.0
  • nesterov=False
  • name=’SGD’
  • **kwargs

Allowed keyword argumentsclipnorm, clipvalue, lr, decay

Decay rate

Decay is a very important parameter in function optimization. It determines how fast the learning rate changes.

Learning rate

How fast do you want the function to be optimized? Careful though, a learning rate that is too high isn’t necessarily good either. Trial and error can find the best relationship between learning rate and accuracy.


Momentum is a characteristic that is being included in neural networks these days that will allow acceleration of the gradients if it is moving in that direction. It does this by updating the weight with every iteration by using the previous weight. If you think about this as an acceleration of a weight to more quickly arrive at minimized loss than if you used constant step size adjustment.

There are some great full code examples on Github.

Concluding Remarks

Minimization of the objective function is one of the keys of machine learning algorithms. To intelligently adjust weights in a non-linear way and converge on the solution as quickly as possible, gradient descent is used.

Read More From AI Buzz