Lecture 5: Logistic Regression¶
Three Steps¶
Step 1: Function Set¶
我們想找的是:$P_{w, b} (C_1 \mid x)$
$$ f_{w, b} = \begin{cases} P_{w, b} (C_1 \mid x) \ge 0.5 & \text{output: } C_1, \\ \text{else} & \text{output: } C_2 \end{cases} $$
$$ \begin{aligned} P_{w, b} (C_1 \mid x) & = \sigma(z) \\ & = \sigma(w \cdot x + b) \end{aligned} $$
我們會有以下的 Function set(包含各種不同的 $w$ 和 $b$):
$$f_{w, b}(x) = P_{w, b} (C_1 \mid x)$$
Step 2: Goodness of a Function¶
若 Training Data 為:
$$ \begin{array} { l l l l } { x ^ { 1 } } & { x ^ { 2 } } & { x ^ { 3 } } & \cdots & { x ^ { N } } \\ { C _ { 1 } } & { C _ { 1 } } & { C _ { 2 } } & \cdots & { C _ { 1 } } \end{array} $$
接下來一樣要決定一個 function 的好壞,假設 data 是從 $f_{w, b}(x) = P_{w, b} (C_1 \mid x)$ 產生。
Given a set of $w$ and $b$, what is its probability of generating the data?
$$ L ( w , b ) = f _ { w , b } \left( x ^ { 1 } \right) f _ { w , b } \left( x ^ { 2 } \right) \left( 1 - f _ { w , b } \left( x ^ { 3 } \right) \right) \cdots f _ { w , b } \left( x ^ { N } \right) $$
The most likely $w^*$ and $b^*$ is the one with the largest $L(w, b)$:
$$ w ^ { * } , b ^ { * } = \arg \max _ { w , b } L ( w , b ) $$
求上式等同於求:
$$ \begin{aligned} w ^ { * } , b ^ { * } & = \arg \min _ { w , b } - \ln L ( w , b ) \\ & = \arg \min _ { w , b } - \ln f _ { w , b } \left( x ^ { 1 } \right) - \ln f _ { w , b } \left( x ^ { 2 } \right) - \ln \left( 1 - f _ { w , b } \left( x ^ { 3 } \right) \right) \cdots - \ln f _ { w , b } \left( x ^ { N } \right) \\ & = \arg \min _ { w , b } \sum _ { n } - \left[ \hat { y } ^ { n } \ln f _ { w , b } \left( x ^ { n } \right) + \left( 1 - \hat { y } ^ { n } \right) \ln \left( 1 - f _ { w , b } \left( x ^ { n } \right) \right) \right] \end{aligned} $$
其中,
- $\hat y^n$:$1$ for $C_1$, $0$ for $C_2$
$\Sigma_n$ 項等同於求以下兩個分佈 $p$、$q$ 的 cross entropy
-
Distribution $p$:
$$ \begin{array} { l } { p ( x = 1 ) = \hat { y } ^ { n } } \\ { p ( x = 0 ) = 1 - \hat { y } ^ { n } } \end{array} $$
-
Distribution $q$:
$$ \begin{array} { l } {{ q } ( x = 1 ) = f \left( x ^ { n } \right) } \\ {{ q } ( x = 0 ) = 1 - f \left( x ^ { n } \right) } \end{array} $$
$$ H ( p , q ) = - \sum _ { x } p ( x ) \ln ( q ( x ) ) $$
問題是:為什麼在 Logistic Regression 不用 rms 當 loss function 了?
答:做微分後,某些項次會為 $0$,導致參數更新過慢。
Step 3: Find the best function¶
決定完 loss function 後,我們要從一個 set 中,找出 best function,先對 $w_i$ 進行偏微分:
$$ \frac { - \ln L ( w , b ) } { \partial w _ { i } } = \sum _ { n } - \left[ \hat { y } ^ { n } \frac { \ln f _ { w , b } \left( x ^ { n } \right) } { \partial w _ { i } } + \left( 1 - \hat { y } ^ { n } \right) \frac { \ln \left( 1 - f _ { w , b } \left( x ^ { n } \right) \right) } { \partial w _ { i } } \right] \tag{*} $$
其中,
- $f _ { w , b } ( x ) = \sigma ( z ) = 1 / (1 + e^{-z})$
- $z = w \cdot x + b = \sum _ { i } w _ { i } x _ { i } + b$
$$ \begin{aligned} \frac { \partial \ln f _ { w , b } ( x ) } { \partial w _ { i } } & = \frac { \partial \ln f _ { w , b } ( x ) } { \partial z } \frac { \partial z } { \partial w _ { i } } \\ & = \frac { \partial \ln \sigma ( z ) } { \partial z } \cdot x_i \\ & = \frac { 1 } { \sigma ( z ) } \frac { \partial \sigma ( z ) } { \partial z } \cdot x_i \\ & = \frac{1}{\sigma(z)} \sigma(z) (1 - \sigma(z)) \cdot x_i \\ & = (1 - \sigma(z)) \cdot x_i \\ & = \left( 1 - f _ { w , b } \left( x \right) \right) \cdot x_i \end{aligned} $$
$$ \begin{aligned} \frac { \partial \ln \left( 1 - f _ { w , b } ( x ) \right) } { \partial w _ { i } } & = \frac { \partial \ln \left( 1 - f _ { w , b } ( x ) \right) } { \partial z } \frac { \partial z } { \partial w _ { i } } \\ & = \frac { \partial \ln ( 1 - \sigma ( z ) ) } { \partial z } \cdot x_i \\ & = - \frac { 1 } { 1 - \sigma ( z ) } \frac { \partial \sigma ( z ) } { \partial z } \cdot x_i \\ & = - \frac{1}{1 - \sigma(z)} \sigma(z) (1 - \sigma(z)) \cdot x_i \\ & = -\sigma(z) \cdot x_i \\ & = -f _ { w , b } \left( x \right) \cdot x_i \end{aligned} $$
透過上面計算出來的結果,我們可以對 $*$ 式進行代換:
$$ \begin{aligned} \frac { - \ln L ( w , b ) } { \partial w _ { i } } & = \sum _ { n } - \left[ \hat { y } ^ { n } \frac { \ln f _ { w , b } \left( x ^ { n } \right) } { \partial w _ { i } } + \left( 1 - \hat { y } ^ { n } \right) \frac { \ln \left( 1 - f _ { w , b } \left( x ^ { n } \right) \right) } { \partial w _ { i } } \right] \\ & = \sum _ { n } - \left[ \hat { y } ^ { n } \left( 1 - f _ { w , b } \left( x ^ { n } \right) \right) x _ { i } ^ { n } - \left( 1 - \hat { y } ^ { n } \right) f _ { w , b } \left( x ^ { n } \right) x _ { i } ^ { n } \right] \\ & = \sum _ { n } - \left[ \hat { y } ^ { n } - \hat { y } ^ { n } f _ { w , b } \left( x ^ { n } \right) - f _ { w , b } \left( x ^ { n } \right) + \hat { y } ^ { n } f _ { W , b } \left( x ^ { n } \right) \right] x _ { i } ^ { n } \\ & = \sum _ { n } - \left( \hat { y } ^ { n } - f _ { w , b } \left( x ^ { n } \right) \right) x _ { i } ^ { n } \end{aligned} $$
參數更新方式如下:
$$ w _ { i } \leftarrow w _ { i } - \eta \sum _ { n } - \left( \hat { y } ^ { n } - f _ { w , b } \left( x ^ { n } \right) \right) x _ { i } ^ { n } $$
Logistic v.s. Linear¶
下面對 Logistic Regression 和 Linear Regression 做比較:
Logistic Regression | Linear Regression | |
---|---|---|
$f_{w, b}(x)$ | $\sigma (\sum_i w_ix_i + b)$ | $\sum_i w_ix_i + b$ |
Output | between $0$ and $1$ | any value |
Training data | $(x^n, \hat y^n)$ | $(x^n, \hat y^n)$ |
$\hat y^n$ | $1$ for class 1, $0$ for class 2 | a real number |
$L(f)$ | $\sum_n C(f(x^n), \hat y^n) = \sum_n - \left[ \hat { y } ^ { n } \ln f \left( x ^ { n } \right) + \left( 1 - \hat { y } ^ { n } \right) \ln \left( 1 - f \left( x ^ { n } \right) \right) \right]$ | $\frac 1 2 \sum_n (f(x^n) - \hat y^n)^2$ |
update method | $w_i \leftarrow w_i - \eta \sum_n - (\hat y^n - f_{w, b}(x^n)) x_i^n$ |
Why not Logistic Regression with Square Error?¶
若我們的 loss function 改寫成 square error 版本:
$$ L ( f ) = \frac { 1 } { 2 } \sum _ { n } \left( f _ { w , b } \left( x ^ { n } \right) - \hat { y } ^ { n } \right) ^ { 2 } $$
在 Step 3: Find the best function 對 $w_i$ 做微分時:
$$ \begin{aligned} \frac { \partial \left( f _ { w , b } ( x ) - \hat { y } \right) ^ { 2 } } { \partial w _ { i } } & = 2 \left( f _ { w , b } ( x ) - \hat { y } \right) \frac { \partial f _ { w , b } ( x ) } { \partial z } \frac { \partial z } { \partial w _ { i } } \\ & = 2 \left( f _ { w , b } ( x ) - \hat { y } \right) f _ { w , b } ( x ) \left( 1 - f _ { w , b } ( x ) \right) x _ { i } \end{aligned} $$
不論 $\hat y^n = 1$ 或 $\hat y^n = 0$,都可能導致 $\partial L / \partial w _ { i } = 0$,無法有效更新參數。
Generative v.s. Discriminative¶
- Benefit of generative model
- With the assumption of probability distribution, less training data is needed
- With the assumption of probability distribution, more robust to the noise
- Priors and class-dependent probabilities can be estimated from different sources.
Multi-class Classification¶
我們用 3 個 classes 來做例子:
Limitation of Logistic Regression¶
給定上述的 4 個 features,我們無法有效的分類,因為沒有任何線性的切法是可以完美將紅點和藍點分開,因此我們有以下兩種方法:
- Feature Transformation: Not always easy to find a good transformation
- Cascading logistic regression models