title: Gradient Descent
Gradient descent is an optimization algorithm for finding the minimum of a function. In deep learning this optimization algorithm is very useful when the parameters cannot be calculated analytically.
What you want to do is to repeatedly update the value of the parameter theta until you minimize the value of the Cost Function J(θ) close to 0;
The size of a step is called the learning rate. A larger learning rate make iterating faster but it might overshoot the global minimum, the value that we are looking for. On the other hand we could prevent this overshooting by decreasing the learning rate; but beware that the smaller you make the learning rate, the more computationally intensive it gets. This could either prolong the computation unnecessarily, or you may not arrive at the global minimum altogether.
A deep learning problem would require you to use multiple features for generating a predictive model. If for example if you are building a predictive model for house pricing, you would have to deal with features like the price itself, number of rooms, lot area, etc. And these features might extremely differ in range, like for example while the lot area might be between 0 and 2000 square feet, the other features like the number of rooms would be between 1 and 9.
This is where feature scaling, also called normalization, comes in handy, to make sure that your machine learning algorithm works properly.
Stochastic Gradient Descent
Machine learning problems usually requires computations over a sample size in the millions, and that could be very computationally intensive.
In stochastic gradient descent you update the parameter for the cost gradient of each example rather that the sum of the cost gradient of all the examples. You could arrive at a set of good parameters faster after only a few passes through the training examples, thus the learning is faster as well.
Mini-batch Gradient Descent
Stochastic Gradient Descent computes the gradient on a single instance. As described above, this is indeed faster than evaluating the entire training set when computing a gradient. However, this is somewhat extreme and can lead to parameter volatility around the optimal solution. Mini-batch gradient descent offers a middle-ground where the gradients are computed on a random sample (a “mini” batch). The larger your mini-batch (the chunkier your selection from the training set), the less parameter volatility you can expect around the optimal solution.