Lecture 3: Gradient Descent¶
Review: Gradient Descent¶
在 Gradient Descent 中,我們想解決的是下列問題:
$$ \theta ^ { * } = \arg \min _ \theta L ( \theta ) $$
其中,
- $L$: loss function
- $\theta$: parameters
假設 $\theta$ 有兩個參數 ${\theta_1, \theta_2}$,我們隨機選取 $\theta ^ { 0 } = \left[ \begin{array} { l } { \theta _ { 1 } ^ { 0 } } \\ { \theta _ { 2 } ^ { 0 } } \end{array} \right]$,所以 $\nabla L ( \theta ) = \left[ \begin{array} { l } { \partial L \left( \theta _ { 1 } \right) / \partial \theta _ { 1 } } \\ { \partial L \left( \theta _ { 2 } \right) / \partial \theta _ { 2 } } \end{array} \right]$
$$ \begin{aligned} \left[ \begin{array} { l } { \theta _ { 1 } ^ { 1 } } \\ { \theta _ { 2 } ^ { 1 } } \end{array} \right] = \left[ \begin{array} { c } { \theta _ { 1 } ^ { 0 } } \\ { \theta _ { 2 } ^ { 0 } } \end{array} \right] - \eta \left[ \begin{array} { l } { \partial L \left( \theta _ { 1 } ^ { 0 } \right) / \partial \theta _ { 1 } } \\ { \partial L \left( \theta _ { 2 } ^ { 0 } \right) / \partial \theta _ { 2 } } \end{array} \right] \to \theta ^ { 1 } = \theta ^ { 0 } - \eta \nabla L \left( \theta ^ { 0 } \right) \\ \left[ \begin{array} { l } { \theta _ { 1 } ^ { 2 } } \\ { \theta _ { 2 } ^ { 2 } } \end{array} \right] = \left[ \begin{array} { c } { \theta _ { 1 } ^ { 1 } } \\ { \theta _ { 2 } ^ { 1 } } \end{array} \right] - \eta \left[ \begin{array} { l } { \partial L \left( \theta _ { 1 } ^ { 1 } \right) / \partial \theta _ { 1 } } \\ { \partial L \left( \theta _ { 2 } ^ { 1 } \right) / \partial \theta _ { 2 } } \end{array} \right] \to \theta ^ { 2 } = \theta ^ { 1 } - \eta \nabla L \left( \theta ^ { 1 } \right) \end{aligned} $$
Tip 1: Tuning your learning rates¶
我們必需要小心的設我們的 learning rate $\eta$
$$ \theta ^i = \theta ^{i - 1} - \eta \nabla L \left( \theta ^{i - 1} \right) $$
Adaptive Learning Rates¶
因為一開始 learning rate 更新的很慢,所以我們可以一開始先用較大的 learning rate 來加速,之後再逐漸下降 learning rate 來避免無法達到 loss 的 minimum,例如:$\eta^t = \eta / \sqrt{t + 1}$。
Adagrad¶
Adagrad 是一個常見的適性方法:Divide the learning rate of each parameter by the root mean square of its previous derivatives
-
Vanilla Gradient Descent
$$w^{t + 1} \leftarrow w^t - \eta^t g^t$$
其中,
- $\eta ^ { t } = \frac \eta { \sqrt { t + 1 } }$
- $g ^ { t } = \frac { \partial L \left( \theta ^ { t } \right) } { \partial w }$
-
Adagrad
$$ \begin{aligned} w^{t + 1} & \leftarrow w^t - \frac{\eta^t}{\sigma^t} g^t \\ \equiv w^{t + 1} & \leftarrow w^t - \frac{\frac \eta { \sqrt { t + 1 } }}{\sqrt { \frac { 1 } { t + 1 } \sum _ { i = 0 } ^ { t } \left( g ^i \right) ^ { 2 } }} g^t \\ \equiv w^{t + 1} & \leftarrow w^t - \frac{\eta}{\sqrt{\sum_{i = 0}^t (g^i)^2}} g^t \\ \end{aligned} $$
其中,
$\sigma^t$:root mean square of the previous derivatives of parameter $w$
$g^t$ 代表 first derivative,分母則是利用歷史的 first derivative 去估計 second derivative。
Tip 2: Stochastic Gradient Descent¶
-
Gradient Descent
Loss is the summation over all training examples:
$$L = \sum_n \Big(\hat y^n - \Big(b + \sum w_i x_i^n \Big) \Big)^2$$
更新參數方式:
$$\theta^i = \theta^{i - 1} - \eta \nabla L (\theta^{i - 1})$$
-
Stochastic Gradient Descent
Pick an example $x^n$
Loss for only one example:
$$L = \Big(\hat y^n - \Big(b + \sum w_i x_i^n \Big) \Big)^2$$
更新參數方式:
$$\theta ^i = \theta^{i - 1} - \eta \nabla L^n (\theta^{i - 1})$$
Stochastic Gradient Descent 能夠加速訓練,一般 Gradient Descent 是在看到所有樣本(訓練數據)後,才計算出 $L$,再更新參數。而 Stochastic Gradient Descent 是每看到一次樣本 $n$ 就計算一次 $L^n$,並更新參數。
Tip 3: Feature Scaling¶
例如我們的 model 為:$y = b + w_1x_1 + w_2x_2$
將 features 從左圖 scaling 成右圖,能確保每一個 features 「貢獻」的數字,不會因為不同參數間的大小範圍而差太多。
若不同參數的大小範圍差太多,那 loss function 等高線方向(負梯度、參數更新方向)不會指向 loss minimum。Feature scaling 讓不同參數的取相同區間範圍的值,變成正圓,更容易新參數。
做法如下,對 $x^1, x^2, x^3, \dots, x^r, \dots, x^R$ 共 $R$ 筆資料,每一個維度 $i$ 做標準化:
Theory¶
接下來要說說,為什麼 Gradient Descent 是可行的呢?
我們欲解的問題為:
$$ \theta ^ { * } = \arg \min _ { \theta } L ( \theta ) \quad \text { by gradient descent } $$
考慮以下陳述是否正確:
每次更新參數,我們就得到 $\theta$ 使得 $L(\theta)$ 更小:
$$L \left( \theta ^ { 0 } \right) > L \left( \theta ^ { 1 } \right) > L \left( \theta ^ { 2 } \right) > \cdots$$
Formal Derivation¶
假設 $\theta$ 有兩個變數 ${\theta_1, \theta_2}$,我們如何輕易的找出給定一個點,離他最近最小的值?
Taylor Series¶
Single-variable¶
Let $h(x)$ be any function infinitely differentiable around $x = x_0$.
$$ \begin{aligned} { h } ( { x } ) & = \sum _ { k = 0 } ^ { \infty } \frac { { h } ^ { ( k ) } \left( x _ { 0 } \right) } { k ! } \left( x - x _ { 0 } \right) ^ { k } \\ & = h \left( x _ { 0 } \right) + h ^ { \prime } \left( x _ { 0 } \right) \left( x - x _ { 0 } \right) + \frac { h ^ { \prime \prime } \left( x _ { 0 } \right) } { 2 ! } \left( x - x _ { 0 } \right) ^ { 2 } + \cdots \end{aligned} $$
當 $x$ 接近 $x_0$ 時:$h(x) \approx h(x_0) + h'(x_0)(x - x_0)$
Multi-variable¶
Let $h(x)$ be any function infinitely differentiable around $x = x_0$ and $y = y_0$.
$$ h ( x , y ) = h \left( x _ { 0 } , y _ { 0 } \right) + \frac { \partial h \left( x _ { 0 } , y _ { 0 } \right) } { \partial x } \left( x - x _ { 0 } \right) + \frac { \partial h \left( x _ { 0 } , y _ { 0 } \right) } { \partial y } \left( y - y _ { 0 } \right) $$
當 $x$ 接近 $x_0$ 和 $y_0$ 時:$h ( x , y ) \approx h \left( x _ { 0 } , y _ { 0 } \right) + \frac { \partial h \left( x _ { 0 } , y _ { 0 } \right) } { \partial x } \left( x - x _ { 0 } \right) + \frac { \partial h \left( x _ { 0 } , y _ { 0 } \right) } { \partial y } \left( y - y _ { 0 } \right)$
Back to Formal Derivation¶
Based on Taylor Series: If the red circle is small enough (i.e., $\theta_1$ and $\theta_2$ is close to $a$ and $b$), in the red circle
$$ \begin{aligned} { L } ( \theta ) & \approx { L } ( a , b ) + \frac { \partial { L } ( a , b ) } { \partial \theta _ { 1 } } \left( \theta _ { 1 } - a \right) + \frac { \partial { L } ( a , b ) } { \partial \theta _ { 2 } } \left( \theta _ { 2 } - b \right) \\ & \approx s + u \left( \theta _ { 1 } - a \right) + v \left( \theta _ { 2 } - b \right) \end{aligned} $$
其中,
- $s = L(a, b)$
- $u = \frac { \partial { L } ( a , b ) } { \partial \theta _ { 1 } } , v = \frac { \partial { L } ( a , b ) } { \partial \theta _ { 2 } }$
Goal: Find $\theta_1$ and $\theta_2$ in the red circle minimizing $L(\theta)$
在 red circle 內代表的是一個圓的方程式:
$$ \begin{aligned} \left( \theta _ { 1 } - a \right) ^ { 2 } + \left( \theta _ { 2 } - b \right) ^ { 2 } & \le d ^ { 2 } \\ (\Delta \theta_1)^2 + (\Delta \theta_2)^2 & \le d^2 \end{aligned} $$
To minimize $L(\theta)$:
$$ \begin{aligned} \left[ \begin{array} { l } { \Delta \theta _ { 1 } } \\ { \Delta \theta _ { 2 } } \end{array} \right] & = - \eta \left[ \begin{array} { l } { u } \\ { v } \end{array} \right] \\ \Rightarrow \left[ \begin{array} { l } { \theta _ { 1 } } \\ { \theta _ { 2 } } \end{array} \right] & = \left[ \begin{array} { l } { a } \\ { b } \end{array} \right] - \eta \left[ \begin{array} { l } { u } \\ { v } \end{array} \right] \\ & = \left[ \begin{array} { l } { a } \\ { b } \end{array} \right] - \eta \left[ \begin{array} { c } { \frac { \partial { L } ( a , b ) } { \partial \theta _ { 1 } } } \\ { \frac { \partial { L } ( a , b ) } { \partial \theta _ { 2 } } } \end{array} \right] \end{aligned} $$