Skip to content

Lecture 3: Gradient Descent

Review: Gradient Descent

在 Gradient Descent 中,我們想解決的是下列問題:

$$ \theta ^ { * } = \arg \min _ \theta L ( \theta ) $$

其中,

  • $L$: loss function
  • $\theta$: parameters

假設 $\theta$ 有兩個參數 ${\theta_1, \theta_2}$,我們隨機選取 $\theta ^ { 0 } = \left[ \begin{array} { l } { \theta _ { 1 } ^ { 0 } } \\ { \theta _ { 2 } ^ { 0 } } \end{array} \right]$,所以 $\nabla L ( \theta ) = \left[ \begin{array} { l } { \partial L \left( \theta _ { 1 } \right) / \partial \theta _ { 1 } } \\ { \partial L \left( \theta _ { 2 } \right) / \partial \theta _ { 2 } } \end{array} \right]$

$$ \begin{aligned} \left[ \begin{array} { l } { \theta _ { 1 } ^ { 1 } } \\ { \theta _ { 2 } ^ { 1 } } \end{array} \right] = \left[ \begin{array} { c } { \theta _ { 1 } ^ { 0 } } \\ { \theta _ { 2 } ^ { 0 } } \end{array} \right] - \eta \left[ \begin{array} { l } { \partial L \left( \theta _ { 1 } ^ { 0 } \right) / \partial \theta _ { 1 } } \\ { \partial L \left( \theta _ { 2 } ^ { 0 } \right) / \partial \theta _ { 2 } } \end{array} \right] \to \theta ^ { 1 } = \theta ^ { 0 } - \eta \nabla L \left( \theta ^ { 0 } \right) \\ \left[ \begin{array} { l } { \theta _ { 1 } ^ { 2 } } \\ { \theta _ { 2 } ^ { 2 } } \end{array} \right] = \left[ \begin{array} { c } { \theta _ { 1 } ^ { 1 } } \\ { \theta _ { 2 } ^ { 1 } } \end{array} \right] - \eta \left[ \begin{array} { l } { \partial L \left( \theta _ { 1 } ^ { 1 } \right) / \partial \theta _ { 1 } } \\ { \partial L \left( \theta _ { 2 } ^ { 1 } \right) / \partial \theta _ { 2 } } \end{array} \right] \to \theta ^ { 2 } = \theta ^ { 1 } - \eta \nabla L \left( \theta ^ { 1 } \right) \end{aligned} $$

Tip 1: Tuning your learning rates

我們必需要小心的設我們的 learning rate $\eta$

$$ \theta ^i = \theta ^{i - 1} - \eta \nabla L \left( \theta ^{i - 1} \right) $$

Adaptive Learning Rates

因為一開始 learning rate 更新的很慢,所以我們可以一開始先用較大的 learning rate 來加速,之後再逐漸下降 learning rate 來避免無法達到 loss 的 minimum,例如:$\eta^t = \eta / \sqrt{t + 1}$。

Adagrad

Adagrad 是一個常見的適性方法:Divide the learning rate of each parameter by the root mean square of its previous derivatives

  • Vanilla Gradient Descent

    $$w^{t + 1} \leftarrow w^t - \eta^t g^t$$

    其中,

    • $\eta ^ { t } = \frac \eta { \sqrt { t + 1 } }$
    • $g ^ { t } = \frac { \partial L \left( \theta ^ { t } \right) } { \partial w }$
  • Adagrad

    $$ \begin{aligned} w^{t + 1} & \leftarrow w^t - \frac{\eta^t}{\sigma^t} g^t \\ \equiv w^{t + 1} & \leftarrow w^t - \frac{\frac \eta { \sqrt { t + 1 } }}{\sqrt { \frac { 1 } { t + 1 } \sum _ { i = 0 } ^ { t } \left( g ^i \right) ^ { 2 } }} g^t \\ \equiv w^{t + 1} & \leftarrow w^t - \frac{\eta}{\sqrt{\sum_{i = 0}^t (g^i)^2}} g^t \\ \end{aligned} $$

    其中,

    $\sigma^t$:root mean square of the previous derivatives of parameter $w$

$g^t$ 代表 first derivative,分母則是利用歷史的 first derivative 去估計 second derivative。

Tip 2: Stochastic Gradient Descent

  • Gradient Descent

    Loss is the summation over all training examples:

    $$L = \sum_n \Big(\hat y^n - \Big(b + \sum w_i x_i^n \Big) \Big)^2$$

    更新參數方式:

    $$\theta^i = \theta^{i - 1} - \eta \nabla L (\theta^{i - 1})$$

  • Stochastic Gradient Descent

    Pick an example $x^n$

    Loss for only one example:

    $$L = \Big(\hat y^n - \Big(b + \sum w_i x_i^n \Big) \Big)^2$$

    更新參數方式:

    $$\theta ^i = \theta^{i - 1} - \eta \nabla L^n (\theta^{i - 1})$$

Stochastic Gradient Descent 能夠加速訓練,一般 Gradient Descent 是在看到所有樣本(訓練數據)後,才計算出 $L$,再更新參數。而 Stochastic Gradient Descent 是每看到一次樣本 $n$ 就計算一次 $L^n$,並更新參數。

Tip 3: Feature Scaling

例如我們的 model 為:$y = b + w_1x_1 + w_2x_2$

將 features 從左圖 scaling 成右圖,能確保每一個 features 「貢獻」的數字,不會因為不同參數間的大小範圍而差太多。

若不同參數的大小範圍差太多,那 loss function 等高線方向(負梯度、參數更新方向)不會指向 loss minimum。Feature scaling 讓不同參數的取相同區間範圍的值,變成正圓,更容易新參數。

做法如下,對 $x^1, x^2, x^3, \dots, x^r, \dots, x^R$ 共 $R$ 筆資料,每一個維度 $i$ 做標準化:

Theory

接下來要說說,為什麼 Gradient Descent 是可行的呢?

我們欲解的問題為:

$$ \theta ^ { * } = \arg \min _ { \theta } L ( \theta ) \quad \text { by gradient descent } $$

考慮以下陳述是否正確:

每次更新參數,我們就得到 $\theta$ 使得 $L(\theta)$ 更小:

$$L \left( \theta ^ { 0 } \right) > L \left( \theta ^ { 1 } \right) > L \left( \theta ^ { 2 } \right) > \cdots$$

Formal Derivation

假設 $\theta$ 有兩個變數 ${\theta_1, \theta_2}$,我們如何輕易的找出給定一個點,離他最近最小的值?

Taylor Series

Single-variable

Let $h(x)$ be any function infinitely differentiable around $x = x_0$.

$$ \begin{aligned} { h } ( { x } ) & = \sum _ { k = 0 } ^ { \infty } \frac { { h } ^ { ( k ) } \left( x _ { 0 } \right) } { k ! } \left( x - x _ { 0 } \right) ^ { k } \\ & = h \left( x _ { 0 } \right) + h ^ { \prime } \left( x _ { 0 } \right) \left( x - x _ { 0 } \right) + \frac { h ^ { \prime \prime } \left( x _ { 0 } \right) } { 2 ! } \left( x - x _ { 0 } \right) ^ { 2 } + \cdots \end{aligned} $$

當 $x$ 接近 $x_0$ 時:$h(x) \approx h(x_0) + h'(x_0)(x - x_0)$

Multi-variable

Let $h(x)$ be any function infinitely differentiable around $x = x_0$ and $y = y_0$.

$$ h ( x , y ) = h \left( x _ { 0 } , y _ { 0 } \right) + \frac { \partial h \left( x _ { 0 } , y _ { 0 } \right) } { \partial x } \left( x - x _ { 0 } \right) + \frac { \partial h \left( x _ { 0 } , y _ { 0 } \right) } { \partial y } \left( y - y _ { 0 } \right) $$

當 $x$ 接近 $x_0$ 和 $y_0$ 時:$h ( x , y ) \approx h \left( x _ { 0 } , y _ { 0 } \right) + \frac { \partial h \left( x _ { 0 } , y _ { 0 } \right) } { \partial x } \left( x - x _ { 0 } \right) + \frac { \partial h \left( x _ { 0 } , y _ { 0 } \right) } { \partial y } \left( y - y _ { 0 } \right)$

Back to Formal Derivation

small

Based on Taylor Series: If the red circle is small enough (i.e., $\theta_1$ and $\theta_2$ is close to $a$ and $b$), in the red circle

$$ \begin{aligned} { L } ( \theta ) & \approx { L } ( a , b ) + \frac { \partial { L } ( a , b ) } { \partial \theta _ { 1 } } \left( \theta _ { 1 } - a \right) + \frac { \partial { L } ( a , b ) } { \partial \theta _ { 2 } } \left( \theta _ { 2 } - b \right) \\ & \approx s + u \left( \theta _ { 1 } - a \right) + v \left( \theta _ { 2 } - b \right) \end{aligned} $$

其中,

  • $s = L(a, b)$
  • $u = \frac { \partial { L } ( a , b ) } { \partial \theta _ { 1 } } , v = \frac { \partial { L } ( a , b ) } { \partial \theta _ { 2 } }$

Goal: Find $\theta_1$ and $\theta_2$ in the red circle minimizing $L(\theta)$

在 red circle 內代表的是一個圓的方程式:

$$ \begin{aligned} \left( \theta _ { 1 } - a \right) ^ { 2 } + \left( \theta _ { 2 } - b \right) ^ { 2 } & \le d ^ { 2 } \\ (\Delta \theta_1)^2 + (\Delta \theta_2)^2 & \le d^2 \end{aligned} $$

To minimize $L(\theta)$:

$$ \begin{aligned} \left[ \begin{array} { l } { \Delta \theta _ { 1 } } \\ { \Delta \theta _ { 2 } } \end{array} \right] & = - \eta \left[ \begin{array} { l } { u } \\ { v } \end{array} \right] \\ \Rightarrow \left[ \begin{array} { l } { \theta _ { 1 } } \\ { \theta _ { 2 } } \end{array} \right] & = \left[ \begin{array} { l } { a } \\ { b } \end{array} \right] - \eta \left[ \begin{array} { l } { u } \\ { v } \end{array} \right] \\ & = \left[ \begin{array} { l } { a } \\ { b } \end{array} \right] - \eta \left[ \begin{array} { c } { \frac { \partial { L } ( a , b ) } { \partial \theta _ { 1 } } } \\ { \frac { \partial { L } ( a , b ) } { \partial \theta _ { 2 } } } \end{array} \right] \end{aligned} $$