Lecture 3: Gradient Descent¶

Review: Gradient Descent¶

在 Gradient Descent 中，我們想解決的是下列問題：

$$ \theta ^ { * } = \arg \min _ \theta L ( \theta ) $$

其中，

$L$: loss function
$\theta$: parameters

假設 $\theta$ 有兩個參數 ${\theta_1, \theta_2}$，我們隨機選取 $\theta ^ { 0 } = \left[ \begin{array} { l } { \theta _ { 1 } ^ { 0 } } \\ { \theta _ { 2 } ^ { 0 } } \end{array} \right]$，所以 $\nabla L ( \theta ) = \left[ \begin{array} { l } { \partial L \left( \theta _ { 1 } \right) / \partial \theta _ { 1 } } \\ { \partial L \left( \theta _ { 2 } \right) / \partial \theta _ { 2 } } \end{array} \right]$

$$ \begin{aligned} \left[ \begin{array} { l } { \theta _ { 1 } ^ { 1 } } \\ { \theta _ { 2 } ^ { 1 } } \end{array} \right] = \left[ \begin{array} { c } { \theta _ { 1 } ^ { 0 } } \\ { \theta _ { 2 } ^ { 0 } } \end{array} \right] - \eta \left[ \begin{array} { l } { \partial L \left( \theta _ { 1 } ^ { 0 } \right) / \partial \theta _ { 1 } } \\ { \partial L \left( \theta _ { 2 } ^ { 0 } \right) / \partial \theta _ { 2 } } \end{array} \right] \to \theta ^ { 1 } = \theta ^ { 0 } - \eta \nabla L \left( \theta ^ { 0 } \right) \\ \left[ \begin{array} { l } { \theta _ { 1 } ^ { 2 } } \\ { \theta _ { 2 } ^ { 2 } } \end{array} \right] = \left[ \begin{array} { c } { \theta _ { 1 } ^ { 1 } } \\ { \theta _ { 2 } ^ { 1 } } \end{array} \right] - \eta \left[ \begin{array} { l } { \partial L \left( \theta _ { 1 } ^ { 1 } \right) / \partial \theta _ { 1 } } \\ { \partial L \left( \theta _ { 2 } ^ { 1 } \right) / \partial \theta _ { 2 } } \end{array} \right] \to \theta ^ { 2 } = \theta ^ { 1 } - \eta \nabla L \left( \theta ^ { 1 } \right) \end{aligned} $$

Tip 1: Tuning your learning rates¶

我們必需要小心的設我們的 learning rate $\eta$

$$ \theta ^i = \theta ^{i - 1} - \eta \nabla L \left( \theta ^{i - 1} \right) $$

Adaptive Learning Rates¶

因為一開始 learning rate 更新的很慢，所以我們可以一開始先用較大的 learning rate 來加速，之後再逐漸下降 learning rate 來避免無法達到 loss 的 minimum，例如：$\eta^t = \eta / \sqrt{t + 1}$。

Adagrad¶

Adagrad 是一個常見的適性方法：Divide the learning rate of each parameter by the root mean square of its previous derivatives

Vanilla Gradient Descent

$$w^{t + 1} \leftarrow w^t - \eta^t g^t$$

其中，
- $\eta ^ { t } = \frac \eta { \sqrt { t + 1 } }$
- $g ^ { t } = \frac { \partial L \left( \theta ^ { t } \right) } { \partial w }$
Adagrad

$$ \begin{aligned} w^{t + 1} & \leftarrow w^t - \frac{\eta^t}{\sigma^t} g^t \\ \equiv w^{t + 1} & \leftarrow w^t - \frac{\frac \eta { \sqrt { t + 1 } }}{\sqrt { \frac { 1 } { t + 1 } \sum _ { i = 0 } ^ { t } \left( g ^i \right) ^ { 2 } }} g^t \\ \equiv w^{t + 1} & \leftarrow w^t - \frac{\eta}{\sqrt{\sum_{i = 0}^t (g^i)^2}} g^t \\ \end{aligned} $$

其中，

$\sigma^t$：root mean square of the previous derivatives of parameter $w$

$g^t$ 代表 first derivative，分母則是利用歷史的 first derivative 去估計 second derivative。

Tip 2: Stochastic Gradient Descent¶

Gradient Descent

Loss is the summation over all training examples:

$$L = \sum_n \Big(\hat y^n - \Big(b + \sum w_i x_i^n \Big) \Big)^2$$

更新參數方式：

$$\theta^i = \theta^{i - 1} - \eta \nabla L (\theta^{i - 1})$$
Stochastic Gradient Descent

Pick an example $x^n$

Loss for only one example:

$$L = \Big(\hat y^n - \Big(b + \sum w_i x_i^n \Big) \Big)^2$$

更新參數方式：

$$\theta ^i = \theta^{i - 1} - \eta \nabla L^n (\theta^{i - 1})$$

Stochastic Gradient Descent 能夠加速訓練，一般 Gradient Descent 是在看到所有樣本（訓練數據）後，才計算出 $L$，再更新參數。而 Stochastic Gradient Descent 是每看到一次樣本 $n$ 就計算一次 $L^n$，並更新參數。

Tip 3: Feature Scaling¶

例如我們的 model 為：$y = b + w_1x_1 + w_2x_2$

將 features 從左圖 scaling 成右圖，能確保每一個 features 「貢獻」的數字，不會因為不同參數間的大小範圍而差太多。

若不同參數的大小範圍差太多，那 loss function 等高線方向（負梯度、參數更新方向）不會指向 loss minimum。Feature scaling 讓不同參數的取相同區間範圍的值，變成正圓，更容易新參數。

做法如下，對 $x^1, x^2, x^3, \dots, x^r, \dots, x^R$ 共 $R$ 筆資料，每一個維度 $i$ 做標準化：

Theory¶

接下來要說說，為什麼 Gradient Descent 是可行的呢？

我們欲解的問題為：

$$ \theta ^ { * } = \arg \min _ { \theta } L ( \theta ) \quad \text { by gradient descent } $$

考慮以下陳述是否正確：

每次更新參數，我們就得到 $\theta$ 使得 $L(\theta)$ 更小：

$$L \left( \theta ^ { 0 } \right) > L \left( \theta ^ { 1 } \right) > L \left( \theta ^ { 2 } \right) > \cdots$$

Formal Derivation¶

假設 $\theta$ 有兩個變數 ${\theta_1, \theta_2}$，我們如何輕易的找出給定一個點，離他最近最小的值？

Taylor Series¶

Single-variable¶

Let $h(x)$ be any function infinitely differentiable around $x = x_0$.

$$ \begin{aligned} { h } ( { x } ) & = \sum _ { k = 0 } ^ { \infty } \frac { { h } ^ { ( k ) } \left( x _ { 0 } \right) } { k ! } \left( x - x _ { 0 } \right) ^ { k } \\ & = h \left( x _ { 0 } \right) + h ^ { \prime } \left( x _ { 0 } \right) \left( x - x _ { 0 } \right) + \frac { h ^ { \prime \prime } \left( x _ { 0 } \right) } { 2 ! } \left( x - x _ { 0 } \right) ^ { 2 } + \cdots \end{aligned} $$

當 $x$ 接近 $x_0$ 時：$h(x) \approx h(x_0) + h'(x_0)(x - x_0)$

Multi-variable¶

Let $h(x)$ be any function infinitely differentiable around $x = x_0$ and $y = y_0$.

$$ h ( x , y ) = h \left( x _ { 0 } , y _ { 0 } \right) + \frac { \partial h \left( x _ { 0 } , y _ { 0 } \right) } { \partial x } \left( x - x _ { 0 } \right) + \frac { \partial h \left( x _ { 0 } , y _ { 0 } \right) } { \partial y } \left( y - y _ { 0 } \right) $$

當 $x$ 接近 $x_0$ 和 $y_0$ 時：$h ( x , y ) \approx h \left( x _ { 0 } , y _ { 0 } \right) + \frac { \partial h \left( x _ { 0 } , y _ { 0 } \right) } { \partial x } \left( x - x _ { 0 } \right) + \frac { \partial h \left( x _ { 0 } , y _ { 0 } \right) } { \partial y } \left( y - y _ { 0 } \right)$

Back to Formal Derivation¶

small

Based on Taylor Series: If the red circle is small enough (i.e., $\theta_1$ and $\theta_2$ is close to $a$ and $b$), in the red circle

$$ \begin{aligned} { L } ( \theta ) & \approx { L } ( a , b ) + \frac { \partial { L } ( a , b ) } { \partial \theta _ { 1 } } \left( \theta _ { 1 } - a \right) + \frac { \partial { L } ( a , b ) } { \partial \theta _ { 2 } } \left( \theta _ { 2 } - b \right) \\ & \approx s + u \left( \theta _ { 1 } - a \right) + v \left( \theta _ { 2 } - b \right) \end{aligned} $$

其中，

$s = L(a, b)$
$u = \frac { \partial { L } ( a , b ) } { \partial \theta _ { 1 } } , v = \frac { \partial { L } ( a , b ) } { \partial \theta _ { 2 } }$

Goal: Find $\theta_1$ and $\theta_2$ in the red circle minimizing $L(\theta)$

在 red circle 內代表的是一個圓的方程式：

$$ \begin{aligned} \left( \theta _ { 1 } - a \right) ^ { 2 } + \left( \theta _ { 2 } - b \right) ^ { 2 } & \le d ^ { 2 } \\ (\Delta \theta_1)^2 + (\Delta \theta_2)^2 & \le d^2 \end{aligned} $$

To minimize $L(\theta)$:

$$ \begin{aligned} \left[ \begin{array} { l } { \Delta \theta _ { 1 } } \\ { \Delta \theta _ { 2 } } \end{array} \right] & = - \eta \left[ \begin{array} { l } { u } \\ { v } \end{array} \right] \\ \Rightarrow \left[ \begin{array} { l } { \theta _ { 1 } } \\ { \theta _ { 2 } } \end{array} \right] & = \left[ \begin{array} { l } { a } \\ { b } \end{array} \right] - \eta \left[ \begin{array} { l } { u } \\ { v } \end{array} \right] \\ & = \left[ \begin{array} { l } { a } \\ { b } \end{array} \right] - \eta \left[ \begin{array} { c } { \frac { \partial { L } ( a , b ) } { \partial \theta _ { 1 } } } \\ { \frac { \partial { L } ( a , b ) } { \partial \theta _ { 2 } } } \end{array} \right] \end{aligned} $$