back propagation

what is back propagation?
it’s gradient decent just with a different form.
Metaphor : divide features into layers, use different combination via weights to mimic the polynomial features.
calculate the gradient of the loss function respect to each weight by chain rule, consist of previous input layer and loss of  output in next layer,one layer at a time,

why we need back propagation?
Non-linear classification need to include lots of features. using back propagation can calculate gradient more quickly (dynamic programming).

how to ?
1.randomly initialize the weights
2.implement forward propagation to get H_{\theta}(x)

    \[ \begin{aligned} a1 &= X \\ z2 &= [ones(size(X,1),1) \quad a1] * W' \\ a2 &= \sigma(z2) \\ .... \\ a_{L} \end{aligned} \]

3.implement the cost function (K classification) with regulation
sum up the cost in multiple class of every training data

    \[ \begin{aligned} J(\theta) = - \frac{1}{m} \sum_{i=1}^{m} \sum_{k=1}^{K} [y_{k}^{(i)} log((h_{\theta}(x^{(i)}))_{k}) + \\ (1-y_{k}^{(i)})log(1-(h_{\theta}(x^{i}))_{k}) ] + \\ \frac{\lambda}{2m} \sum_{l=1}^{L-1} \sum_{i=1}^{s_{l}} \sum_{j=1}^{s_{l+1}} (\theta_{j,i}^{(l)})^{2} \end{aligned} \]

4.calculate output layer error, which is a-y

    \[ \begin{aligned} \sigma(z) &= \tfrac{1}{1 + e^{-z}} = (1 + e^{-z})^{-1} \implies & \\ \tfrac{\partial \sigma}{\partial z} &= \sigma'(z) = (-1)(1 + e^{-z})^{-2} \tfrac{\partial}{\partial z}(1 + e^{-z}) \\ &= \cancel{(-1)}(1 + e^{-z})^{-2} (e^{-z}) \cancel{(-1)} \\ &= \left(\tfrac{1}{1 + e^{-z}}\right)\left(\tfrac{1 + e^{-z} -1}{1 + e^{-z}}\right) \\ &= \sigma(z) (1 - \sigma(z)) &\quad\square \\ a &= \sigma(z) \implies \tfrac{\partial a}{\partial z} = \sigma'(z) = a(1 - a) &\quad\square \\ J(\Theta) &= -[y \log(a) + (1-y) \log(1-a)] \implies & \\ \tfrac{\partial J}{\partial a} &= -[ y (\tfrac{1}{a}) + (1 - y) \tfrac{(-1)}{1 - a} ] \\ &= -[ (\tfrac{y}{a}) - ( \tfrac{1 - y}{1 - a} )] \\ &= - \tfrac{1}{a (1 - a)} [ (1 - a) y - a (1 - y) ] \\ &= - \tfrac{1}{a (1 - a)} [ y - \cancel{a y} - a + \cancel{a y} ] \\ &= \tfrac{1}{a (1 - a)} [ a - y ] &\quad\square \\ \delta^{(L)} &= \tfrac{\partial}{\partial z} J \\ &= \tfrac{\partial J}{\partial a} \tfrac{\partial a}{\partial z} \\ &= \tfrac{1}{a (1 - a)} [ a - y ] \ a (1 - a) \\ &=  a - y = a^{(L)} - y \end{aligned} \]

5.based on the output layer error , adjust weight on every layer.

    \[ \begin{aligned} \frac{\partial C}{\partial w} &= \frac{\partial z^{(l)}}{\partial w} * \frac{\partial a^{(l)}}{\partial z^{(l)}} * \frac{\partial C}{\partial a^{(l)}} \\ \frac{\partial z^{(l)}}{\partial w}  &= a^{(l-1)} \: input \: layer \: unit  \\ \frac{\partial a^{(l)}}{\partial z^{(l)}} &= \sigma'(z^{(l)}) = \sigma(z^{(l)}).*(1 - \sigma(z^{(l)})) \\ \frac{\partial C}{\partial a^{(l)}} &= \frac{\partial z^{l+1}}{\partial a^{(l)}} * \frac{\partial C}{\partial z^{(l+1)}} \end{aligned} \]

6.back to 2

ref : https://www.youtube.com/watch?v=ibJpTrp5mcE&list=PLZ9Enh6e3wIk9CkY7b-oqz8P-VIjLUU1h&index=5&t=0s

coursera marchine learining unit one

definition
a computer program is said to learn from experience E with respect to task T and some performance measure P, if its performance on T, as measured by P , improves with experience E.

type
1. supervised learning
1.1 classification (mapping to label, discrete)
1.2 regression (mapping to continuous number)
2.unsupervised learning (cluster data)

supervised learning workflow

(from coursera)

how to measure the accuracy of the hypothesis (linear)
#linear regression cost function

     \begin{equation} \[ J(\theta_{0},\theta_{1}) = \frac{1}{2m}\sum_{i=1}^{m} (\hat{y}_{i}-y_{i})^2 =\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x_{i})-y_{i})^2 \] \[ \hat{y} : \; predictive \;value \] \[ h_{\theta}(x) : \;linear function \;form \;by \; \theta_{0},\theta_{1} \] \end{equation}

find the most probable theta to minimize the cost function.when the cost function equal 0 means all the data plot lies in the line.

how to find the probable theta to minimize residual
#gradient descent
why gradient descent works
repeat until convergence (simultaneous update all the theta) {

     \[ \theta_{i} := \theta_{i}-\alpha\frac{\partial}{\partial \theta_{i}} J(\theta_{0},\theta_{1}) \]

}
where i = {0,1}
#gradient descent for linear regression
repeat until convergence {

     \[\theta_{0} := \theta_{0} - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x_{i})-y_{i}) \] \[\theta_{1} := \theta_{1} - \alpha\frac{1}{m}\sum_{i=1}^{m}((h_{\theta}(x_{i})-y_{i})*x_{i})\]

}
detail:
https://math.stackexchange.com/q/1695446

naive bayes

it‘s all about conditional probability and combined probability.

bg2011082502
P(A|B)=P(AB) / P(B)
=> P(A|B)=P(A)P(B|A) / P(B)
=> P(A|B)=P(A)P(B|A) / (P(A)P(B|A)+P(A’)P(B|A’))

joint probability
assumption : events are independent .
E1 = p(s|w1)*p(s|w2)*p(s)
E2 = (1-p(s|w1))*(1-p(s|w2))*(1-p(s))

P(S|w1,w2) = P(S)P(S|w1)P(S|w2)   /  (  P(S)P(S|w1)P(S|w2) +  P(~S)P(~S|w1)P(~S|w2) )

 

bayesian inference

后验概率 = 先验概率 x 调整因子

 

usage example

1.spam mail filter

 

reference:

http://www.ruanyifeng.com/blog/2011/08/bayesian_inference_part_one.html

http://www.paulgraham.com/spam.html