coursera marchine learining unit one

definition
a computer program is said to learn from experience E with respect to task T and some performance measure P, if its performance on T, as measured by P , improves with experience E.

type
1. supervised learning
1.1 classification (mapping to label, discrete)
1.2 regression (mapping to continuous number)
2.unsupervised learning (cluster data)

supervised learning workflow

(from coursera)

how to measure the accuracy of the hypothesis (linear)
#linear regression cost function

     \begin{equation} \[ J(\theta_{0},\theta_{1}) = \frac{1}{2m}\sum_{i=1}^{m} (\hat{y}_{i}-y_{i})^2 =\frac{1}{2m}\sum_{i=1}^{m}(h_{\theta}(x_{i})-y_{i})^2 \] \[ \hat{y} : \; predictive \;value \] \[ h_{\theta}(x) : \;linear function \;form \;by \; \theta_{0},\theta_{1} \] \end{equation}

find the most probable theta to minimize the cost function.when the cost function equal 0 means all the data plot lies in the line.

how to find the probable theta to minimize residual
#gradient descent
why gradient descent works
repeat until convergence (simultaneous update all the theta) {

     \[ \theta_{i} := \theta_{i}-\alpha\frac{\partial}{\partial \theta_{i}} J(\theta_{0},\theta_{1}) \]

}
where i = {0,1}
#gradient descent for linear regression
repeat until convergence {

     \[\theta_{0} := \theta_{0} - \alpha\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x_{i})-y_{i}) \] \[\theta_{1} := \theta_{1} - \alpha\frac{1}{m}\sum_{i=1}^{m}((h_{\theta}(x_{i})-y_{i})*x_{i})\]

}
detail:
https://math.stackexchange.com/q/1695446