gradient descent

reference:
https://www.zhihu.com/question/36301367/answer/142096153
https://www.kdnuggets.com/2017/04/simple-understand-gradient-descent-algorithm.html

prerequisite
1.derivative (only one variable)

(1)    \begin{equation*} \frac{d}{d_{x}}f(x) = f'(x_{0}) = \lim_{\Delta x\to 0} \frac{\Delta y}{\Delta x} = \lim_{\Delta x \to 0} \frac{f(x_{0}+\Delta x)-f(x_{0})}{\Delta x} \end{equation*}

it measures the sensitivity of f(x) with respect to a trivial change of x (slope of the tangent line)
(from wiki)
2. partial derivative (multi variable, but not any direction)
at lease two variables.
Actually,it’s derivative based on each dimension(x bar,y bar ,z bar….)

(2)    \begin{equation*} \frac{\partial}{\partial x_{i}}f(X) = \lim_{h\to 0} \frac{f(x_{1},x_{2}..,x_{i}+h,x_{i+1},..,x_{n})-f(x_{0},x_{1}..,x_{n})}{h} \end{equation*}


3.directional derivative (multi variables ,any direction)
break a vector down  into each dimension,so we can use partial derivative to solve this

actually it’s just vector dot product ([partial derivative] * [vector in this direction])
it’s all about how the vector in this direction affects the partial derivative.

     \begin{align*} D_{\vec v}f(\vec i) = \nabla_{\vec v}f(\vec i) = \lim_{h \to 0} \frac{f(\vec i+h \vec v)-f(\vec i)}{h} = \sum_{n=0}^{m} \vec v_{n}*\frac{\partial f}{\partial \vec i_{n}} \end{align*}

we only care about the direction of the vector,so vector v is a unit vector

so what is gradient?
gradient is about finding a direction which we can get the steepest slope. it means we need to find a direction which maximize [partial derivative ]* [initial vector].
multiplication of two vectors ,obviously when they stick together(theta = 0) ,we can get the steepest slope. meanwhile ,we only care about the direction.
so the gradient = [partial derivative]

In conclusion, if we want to minimize the cost function, we can decrease each direction in the vector by its gradient respectively. This is gradient descent.

another way to comprehend.
 \Delta C = \bigtriangledown C * \Delta v \;\;\;\;\; \bigtriangledown C (matrix\; of\; partial\; derivative)
if we want to minimize the cost , we should find \Delta v to make \Delta C negative.
suppose we choose \Delta v = - \alpha * \bigtriangledown C , alpha is a small,positive parameter.
then \Delta C = - \alpha * ||\bigtriangledown C||^2
so the cost will be negative.this is what we looking for.
when
 v -> v' = v - \alpha * \bigtriangledown C
, we can minimize the cost function

coursera marchine learining unit one

definition
a computer program is said to learn from experience E with respect to task T and some performance measure P, if its performance on T, as measured by P , improves with experience E.

type
1. supervised learning
1.1 classification (mapping to label, discrete)
1.2 regression (mapping to continuous number)
2.unsupervised learning (cluster data)

supervised learning workflow

(from coursera)

how to measure the accuracy of the hypothesis (linear)
#linear regression cost function

     \begin{equation} \[ J(\theta_{0},\theta_{1}) = \frac{1}{2m}\sum_{i=1}^{m} (\hat{y}_{i}-y_{i})^2 =\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x_{i})-y_{i})^2 \] \[ \hat{y} : \; predictive \;value \] \[ h_{\theta}(x) : \;linear function \;form \;by \; \theta_{0},\theta_{1} \] \end{equation}

find the most probable theta to minimize the cost function.when the cost function equal 0 means all the data plot lies in the line.

how to find the probable theta to minimize residual
#gradient descent
why gradient descent works
repeat until convergence (simultaneous update all the theta) {

     \[ \theta_{i} := \theta_{i}-\alpha\frac{\partial}{\partial \theta_{i}} J(\theta_{0},\theta_{1}) \]

}
where i = {0,1}
#gradient descent for linear regression
repeat until convergence {

     \[\theta_{0} := \theta_{0} - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x_{i})-y_{i}) \] \[\theta_{1} := \theta_{1} - \alpha\frac{1}{m}\sum_{i=1}^{m}((h_{\theta}(x_{i})-y_{i})*x_{i})\]

}
detail:
https://math.stackexchange.com/q/1695446