Skip to content

Lecture 5: Logistic Regression

Three Steps

Step 1: Function Set

我們想找的是:Pw,b(C1x)P_{w, b} (C_1 \mid x)

fw,b={Pw,b(C1x)0.5output: C1,elseoutput: C2 f_{w, b} = \begin{cases} P_{w, b} (C_1 \mid x) \ge 0.5 & \text{output: } C_1, \\ \text{else} & \text{output: } C_2 \end{cases}

Pw,b(C1x)=σ(z)=σ(wx+b) \begin{aligned} P_{w, b} (C_1 \mid x) & = \sigma(z) \\ & = \sigma(w \cdot x + b) \end{aligned}

我們會有以下的 Function set(包含各種不同的 wwbb):

fw,b(x)=Pw,b(C1x)f_{w, b}(x) = P_{w, b} (C_1 \mid x)

Step 2: Goodness of a Function

若 Training Data 為:

x1x2x3xNC1C1C2C1 \begin{array} { l l l l } { x ^ { 1 } } & { x ^ { 2 } } & { x ^ { 3 } } & \cdots & { x ^ { N } } \\ { C _ { 1 } } & { C _ { 1 } } & { C _ { 2 } } & \cdots & { C _ { 1 } } \end{array}

接下來一樣要決定一個 function 的好壞,假設 data 是從 fw,b(x)=Pw,b(C1x)f_{w, b}(x) = P_{w, b} (C_1 \mid x) 產生。

Given a set of ww and bb, what is its probability of generating the data?

L(w,b)=fw,b(x1)fw,b(x2)(1fw,b(x3))fw,b(xN) L ( w , b ) = f _ { w , b } \left( x ^ { 1 } \right) f _ { w , b } \left( x ^ { 2 } \right) \left( 1 - f _ { w , b } \left( x ^ { 3 } \right) \right) \cdots f _ { w , b } \left( x ^ { N } \right)

The most likely ww^* and bb^* is the one with the largest L(w,b)L(w, b):

w,b=argmaxw,bL(w,b) w ^ { * } , b ^ { * } = \arg \max _ { w , b } L ( w , b )

求上式等同於求:

w,b=argminw,blnL(w,b)=argminw,blnfw,b(x1)lnfw,b(x2)ln(1fw,b(x3))lnfw,b(xN)=argminw,bn[y^nlnfw,b(xn)+(1y^n)ln(1fw,b(xn))] \begin{aligned} w ^ { * } , b ^ { * } & = \arg \min _ { w , b } - \ln L ( w , b ) \\ & = \arg \min _ { w , b } - \ln f _ { w , b } \left( x ^ { 1 } \right) - \ln f _ { w , b } \left( x ^ { 2 } \right) - \ln \left( 1 - f _ { w , b } \left( x ^ { 3 } \right) \right) \cdots - \ln f _ { w , b } \left( x ^ { N } \right) \\ & = \arg \min _ { w , b } \sum _ { n } - \left[ \hat { y } ^ { n } \ln f _ { w , b } \left( x ^ { n } \right) + \left( 1 - \hat { y } ^ { n } \right) \ln \left( 1 - f _ { w , b } \left( x ^ { n } \right) \right) \right] \end{aligned}

其中,

  • y^n\hat y^n11 for C1C_1, 00 for C2C_2

Σn\Sigma_n 項等同於求以下兩個分佈 ppqq 的 cross entropy

  • Distribution pp:

    p(x=1)=y^np(x=0)=1y^n \begin{array} { l } { p ( x = 1 ) = \hat { y } ^ { n } } \\ { p ( x = 0 ) = 1 - \hat { y } ^ { n } } \end{array}

  • Distribution qq:

    q(x=1)=f(xn)q(x=0)=1f(xn) \begin{array} { l } {{ q } ( x = 1 ) = f \left( x ^ { n } \right) } \\ {{ q } ( x = 0 ) = 1 - f \left( x ^ { n } \right) } \end{array}

H(p,q)=xp(x)ln(q(x)) H ( p , q ) = - \sum _ { x } p ( x ) \ln ( q ( x ) )

問題是:為什麼在 Logistic Regression 不用 rms 當 loss function 了?

答:做微分後,某些項次會為 00,導致參數更新過慢。

Step 3: Find the best function

決定完 loss function 後,我們要從一個 set 中,找出 best function,先對 wiw_i 進行偏微分:

lnL(w,b)wi=n[y^nlnfw,b(xn)wi+(1y^n)ln(1fw,b(xn))wi](*) \frac { - \ln L ( w , b ) } { \partial w _ { i } } = \sum _ { n } - \left[ \hat { y } ^ { n } \frac { \ln f _ { w , b } \left( x ^ { n } \right) } { \partial w _ { i } } + \left( 1 - \hat { y } ^ { n } \right) \frac { \ln \left( 1 - f _ { w , b } \left( x ^ { n } \right) \right) } { \partial w _ { i } } \right] \tag{*}

其中,

  • fw,b(x)=σ(z)=1/(1+ez)f _ { w , b } ( x ) = \sigma ( z ) = 1 / (1 + e^{-z})
  • z=wx+b=iwixi+bz = w \cdot x + b = \sum _ { i } w _ { i } x _ { i } + b

lnfw,b(x)wi=lnfw,b(x)zzwi=lnσ(z)zxi=1σ(z)σ(z)zxi=1σ(z)σ(z)(1σ(z))xi=(1σ(z))xi=(1fw,b(x))xi \begin{aligned} \frac { \partial \ln f _ { w , b } ( x ) } { \partial w _ { i } } & = \frac { \partial \ln f _ { w , b } ( x ) } { \partial z } \frac { \partial z } { \partial w _ { i } } \\ & = \frac { \partial \ln \sigma ( z ) } { \partial z } \cdot x_i \\ & = \frac { 1 } { \sigma ( z ) } \frac { \partial \sigma ( z ) } { \partial z } \cdot x_i \\ & = \frac{1}{\sigma(z)} \sigma(z) (1 - \sigma(z)) \cdot x_i \\ & = (1 - \sigma(z)) \cdot x_i \\ & = \left( 1 - f _ { w , b } \left( x \right) \right) \cdot x_i \end{aligned}

ln(1fw,b(x))wi=ln(1fw,b(x))zzwi=ln(1σ(z))zxi=11σ(z)σ(z)zxi=11σ(z)σ(z)(1σ(z))xi=σ(z)xi=fw,b(x)xi \begin{aligned} \frac { \partial \ln \left( 1 - f _ { w , b } ( x ) \right) } { \partial w _ { i } } & = \frac { \partial \ln \left( 1 - f _ { w , b } ( x ) \right) } { \partial z } \frac { \partial z } { \partial w _ { i } } \\ & = \frac { \partial \ln ( 1 - \sigma ( z ) ) } { \partial z } \cdot x_i \\ & = - \frac { 1 } { 1 - \sigma ( z ) } \frac { \partial \sigma ( z ) } { \partial z } \cdot x_i \\ & = - \frac{1}{1 - \sigma(z)} \sigma(z) (1 - \sigma(z)) \cdot x_i \\ & = -\sigma(z) \cdot x_i \\ & = -f _ { w , b } \left( x \right) \cdot x_i \end{aligned}

透過上面計算出來的結果,我們可以對 * 式進行代換:

lnL(w,b)wi=n[y^nlnfw,b(xn)wi+(1y^n)ln(1fw,b(xn))wi]=n[y^n(1fw,b(xn))xin(1y^n)fw,b(xn)xin]=n[y^ny^nfw,b(xn)fw,b(xn)+y^nfW,b(xn)]xin=n(y^nfw,b(xn))xin \begin{aligned} \frac { - \ln L ( w , b ) } { \partial w _ { i } } & = \sum _ { n } - \left[ \hat { y } ^ { n } \frac { \ln f _ { w , b } \left( x ^ { n } \right) } { \partial w _ { i } } + \left( 1 - \hat { y } ^ { n } \right) \frac { \ln \left( 1 - f _ { w , b } \left( x ^ { n } \right) \right) } { \partial w _ { i } } \right] \\ & = \sum _ { n } - \left[ \hat { y } ^ { n } \left( 1 - f _ { w , b } \left( x ^ { n } \right) \right) x _ { i } ^ { n } - \left( 1 - \hat { y } ^ { n } \right) f _ { w , b } \left( x ^ { n } \right) x _ { i } ^ { n } \right] \\ & = \sum _ { n } - \left[ \hat { y } ^ { n } - \hat { y } ^ { n } f _ { w , b } \left( x ^ { n } \right) - f _ { w , b } \left( x ^ { n } \right) + \hat { y } ^ { n } f _ { W , b } \left( x ^ { n } \right) \right] x _ { i } ^ { n } \\ & = \sum _ { n } - \left( \hat { y } ^ { n } - f _ { w , b } \left( x ^ { n } \right) \right) x _ { i } ^ { n } \end{aligned}

參數更新方式如下:

wiwiηn(y^nfw,b(xn))xin w _ { i } \leftarrow w _ { i } - \eta \sum _ { n } - \left( \hat { y } ^ { n } - f _ { w , b } \left( x ^ { n } \right) \right) x _ { i } ^ { n }

Logistic v.s. Linear

下面對 Logistic Regression 和 Linear Regression 做比較:

Logistic Regression Linear Regression
fw,b(x)f_{w, b}(x) σ(iwixi+b)\sigma (\sum_i w_ix_i + b) iwixi+b\sum_i w_ix_i + b
Output between 00 and 11 any value
Training data (xn,y^n)(x^n, \hat y^n) (xn,y^n)(x^n, \hat y^n)
y^n\hat y^n 11 for class 1, 00 for class 2 a real number
L(f)L(f) nC(f(xn),y^n)=n[y^nlnf(xn)+(1y^n)ln(1f(xn))]\sum_n C(f(x^n), \hat y^n) = \sum_n - \left[ \hat { y } ^ { n } \ln f \left( x ^ { n } \right) + \left( 1 - \hat { y } ^ { n } \right) \ln \left( 1 - f \left( x ^ { n } \right) \right) \right] 12n(f(xn)y^n)2\frac 1 2 \sum_n (f(x^n) - \hat y^n)^2
update method wiwiηn(y^nfw,b(xn))xinw_i \leftarrow w_i - \eta \sum_n - (\hat y^n - f_{w, b}(x^n)) x_i^n

Why not Logistic Regression with Square Error?

若我們的 loss function 改寫成 square error 版本:

L(f)=12n(fw,b(xn)y^n)2 L ( f ) = \frac { 1 } { 2 } \sum _ { n } \left( f _ { w , b } \left( x ^ { n } \right) - \hat { y } ^ { n } \right) ^ { 2 }

在 Step 3: Find the best function 對 wiw_i 做微分時:

(fw,b(x)y^)2wi=2(fw,b(x)y^)fw,b(x)zzwi=2(fw,b(x)y^)fw,b(x)(1fw,b(x))xi \begin{aligned} \frac { \partial \left( f _ { w , b } ( x ) - \hat { y } \right) ^ { 2 } } { \partial w _ { i } } & = 2 \left( f _ { w , b } ( x ) - \hat { y } \right) \frac { \partial f _ { w , b } ( x ) } { \partial z } \frac { \partial z } { \partial w _ { i } } \\ & = 2 \left( f _ { w , b } ( x ) - \hat { y } \right) f _ { w , b } ( x ) \left( 1 - f _ { w , b } ( x ) \right) x _ { i } \end{aligned}

不論 y^n=1\hat y^n = 1y^n=0\hat y^n = 0,都可能導致 L/wi=0\partial L / \partial w _ { i } = 0,無法有效更新參數。

Generative v.s. Discriminative

  • Benefit of generative model
    • With the assumption of probability distribution, less training data is needed
    • With the assumption of probability distribution, more robust to the noise
    • Priors and class-dependent probabilities can be estimated from different sources.

Multi-class Classification

我們用 3 個 classes 來做例子:

Limitation of Logistic Regression

給定上述的 4 個 features,我們無法有效的分類,因為沒有任何線性的切法是可以完美將紅點和藍點分開,因此我們有以下兩種方法:

  • Feature Transformation: Not always easy to find a good transformation
  • Cascading logistic regression models