Skip to content

Lecture 5: Logistic Regression

Three Steps

Step 1: Function Set

我們想找的是:$P_{w, b} (C_1 \mid x)$

$$ f_{w, b} = \begin{cases} P_{w, b} (C_1 \mid x) \ge 0.5 & \text{output: } C_1, \\ \text{else} & \text{output: } C_2 \end{cases} $$

$$ \begin{aligned} P_{w, b} (C_1 \mid x) & = \sigma(z) \\ & = \sigma(w \cdot x + b) \end{aligned} $$

我們會有以下的 Function set(包含各種不同的 $w$ 和 $b$):

$$f_{w, b}(x) = P_{w, b} (C_1 \mid x)$$

Step 2: Goodness of a Function

若 Training Data 為:

$$ \begin{array} { l l l l } { x ^ { 1 } } & { x ^ { 2 } } & { x ^ { 3 } } & \cdots & { x ^ { N } } \\ { C _ { 1 } } & { C _ { 1 } } & { C _ { 2 } } & \cdots & { C _ { 1 } } \end{array} $$

接下來一樣要決定一個 function 的好壞,假設 data 是從 $f_{w, b}(x) = P_{w, b} (C_1 \mid x)$ 產生。

Given a set of $w$ and $b$, what is its probability of generating the data?

$$ L ( w , b ) = f _ { w , b } \left( x ^ { 1 } \right) f _ { w , b } \left( x ^ { 2 } \right) \left( 1 - f _ { w , b } \left( x ^ { 3 } \right) \right) \cdots f _ { w , b } \left( x ^ { N } \right) $$

The most likely $w^*$ and $b^*$ is the one with the largest $L(w, b)$:

$$ w ^ { * } , b ^ { * } = \arg \max _ { w , b } L ( w , b ) $$

求上式等同於求:

$$ \begin{aligned} w ^ { * } , b ^ { * } & = \arg \min _ { w , b } - \ln L ( w , b ) \\ & = \arg \min _ { w , b } - \ln f _ { w , b } \left( x ^ { 1 } \right) - \ln f _ { w , b } \left( x ^ { 2 } \right) - \ln \left( 1 - f _ { w , b } \left( x ^ { 3 } \right) \right) \cdots - \ln f _ { w , b } \left( x ^ { N } \right) \\ & = \arg \min _ { w , b } \sum _ { n } - \left[ \hat { y } ^ { n } \ln f _ { w , b } \left( x ^ { n } \right) + \left( 1 - \hat { y } ^ { n } \right) \ln \left( 1 - f _ { w , b } \left( x ^ { n } \right) \right) \right] \end{aligned} $$

其中,

  • $\hat y^n$:$1$ for $C_1$, $0$ for $C_2$

$\Sigma_n$ 項等同於求以下兩個分佈 $p$、$q$ 的 cross entropy

  • Distribution $p$:

    $$ \begin{array} { l } { p ( x = 1 ) = \hat { y } ^ { n } } \\ { p ( x = 0 ) = 1 - \hat { y } ^ { n } } \end{array} $$

  • Distribution $q$:

    $$ \begin{array} { l } {{ q } ( x = 1 ) = f \left( x ^ { n } \right) } \\ {{ q } ( x = 0 ) = 1 - f \left( x ^ { n } \right) } \end{array} $$

$$ H ( p , q ) = - \sum _ { x } p ( x ) \ln ( q ( x ) ) $$

問題是:為什麼在 Logistic Regression 不用 rms 當 loss function 了?

答:做微分後,某些項次會為 $0$,導致參數更新過慢。

Step 3: Find the best function

決定完 loss function 後,我們要從一個 set 中,找出 best function,先對 $w_i$ 進行偏微分:

$$ \frac { - \ln L ( w , b ) } { \partial w _ { i } } = \sum _ { n } - \left[ \hat { y } ^ { n } \frac { \ln f _ { w , b } \left( x ^ { n } \right) } { \partial w _ { i } } + \left( 1 - \hat { y } ^ { n } \right) \frac { \ln \left( 1 - f _ { w , b } \left( x ^ { n } \right) \right) } { \partial w _ { i } } \right] \tag{*} $$

其中,

  • $f _ { w , b } ( x ) = \sigma ( z ) = 1 / (1 + e^{-z})$
  • $z = w \cdot x + b = \sum _ { i } w _ { i } x _ { i } + b$

$$ \begin{aligned} \frac { \partial \ln f _ { w , b } ( x ) } { \partial w _ { i } } & = \frac { \partial \ln f _ { w , b } ( x ) } { \partial z } \frac { \partial z } { \partial w _ { i } } \\ & = \frac { \partial \ln \sigma ( z ) } { \partial z } \cdot x_i \\ & = \frac { 1 } { \sigma ( z ) } \frac { \partial \sigma ( z ) } { \partial z } \cdot x_i \\ & = \frac{1}{\sigma(z)} \sigma(z) (1 - \sigma(z)) \cdot x_i \\ & = (1 - \sigma(z)) \cdot x_i \\ & = \left( 1 - f _ { w , b } \left( x \right) \right) \cdot x_i \end{aligned} $$

$$ \begin{aligned} \frac { \partial \ln \left( 1 - f _ { w , b } ( x ) \right) } { \partial w _ { i } } & = \frac { \partial \ln \left( 1 - f _ { w , b } ( x ) \right) } { \partial z } \frac { \partial z } { \partial w _ { i } } \\ & = \frac { \partial \ln ( 1 - \sigma ( z ) ) } { \partial z } \cdot x_i \\ & = - \frac { 1 } { 1 - \sigma ( z ) } \frac { \partial \sigma ( z ) } { \partial z } \cdot x_i \\ & = - \frac{1}{1 - \sigma(z)} \sigma(z) (1 - \sigma(z)) \cdot x_i \\ & = -\sigma(z) \cdot x_i \\ & = -f _ { w , b } \left( x \right) \cdot x_i \end{aligned} $$

透過上面計算出來的結果,我們可以對 $*$ 式進行代換:

$$ \begin{aligned} \frac { - \ln L ( w , b ) } { \partial w _ { i } } & = \sum _ { n } - \left[ \hat { y } ^ { n } \frac { \ln f _ { w , b } \left( x ^ { n } \right) } { \partial w _ { i } } + \left( 1 - \hat { y } ^ { n } \right) \frac { \ln \left( 1 - f _ { w , b } \left( x ^ { n } \right) \right) } { \partial w _ { i } } \right] \\ & = \sum _ { n } - \left[ \hat { y } ^ { n } \left( 1 - f _ { w , b } \left( x ^ { n } \right) \right) x _ { i } ^ { n } - \left( 1 - \hat { y } ^ { n } \right) f _ { w , b } \left( x ^ { n } \right) x _ { i } ^ { n } \right] \\ & = \sum _ { n } - \left[ \hat { y } ^ { n } - \hat { y } ^ { n } f _ { w , b } \left( x ^ { n } \right) - f _ { w , b } \left( x ^ { n } \right) + \hat { y } ^ { n } f _ { W , b } \left( x ^ { n } \right) \right] x _ { i } ^ { n } \\ & = \sum _ { n } - \left( \hat { y } ^ { n } - f _ { w , b } \left( x ^ { n } \right) \right) x _ { i } ^ { n } \end{aligned} $$

參數更新方式如下:

$$ w _ { i } \leftarrow w _ { i } - \eta \sum _ { n } - \left( \hat { y } ^ { n } - f _ { w , b } \left( x ^ { n } \right) \right) x _ { i } ^ { n } $$

Logistic v.s. Linear

下面對 Logistic Regression 和 Linear Regression 做比較:

Logistic Regression Linear Regression
$f_{w, b}(x)$ $\sigma (\sum_i w_ix_i + b)$ $\sum_i w_ix_i + b$
Output between $0$ and $1$ any value
Training data $(x^n, \hat y^n)$ $(x^n, \hat y^n)$
$\hat y^n$ $1$ for class 1, $0$ for class 2 a real number
$L(f)$ $\sum_n C(f(x^n), \hat y^n) = \sum_n - \left[ \hat { y } ^ { n } \ln f \left( x ^ { n } \right) + \left( 1 - \hat { y } ^ { n } \right) \ln \left( 1 - f \left( x ^ { n } \right) \right) \right]$ $\frac 1 2 \sum_n (f(x^n) - \hat y^n)^2$
update method $w_i \leftarrow w_i - \eta \sum_n - (\hat y^n - f_{w, b}(x^n)) x_i^n$

Why not Logistic Regression with Square Error?

若我們的 loss function 改寫成 square error 版本:

$$ L ( f ) = \frac { 1 } { 2 } \sum _ { n } \left( f _ { w , b } \left( x ^ { n } \right) - \hat { y } ^ { n } \right) ^ { 2 } $$

在 Step 3: Find the best function 對 $w_i$ 做微分時:

$$ \begin{aligned} \frac { \partial \left( f _ { w , b } ( x ) - \hat { y } \right) ^ { 2 } } { \partial w _ { i } } & = 2 \left( f _ { w , b } ( x ) - \hat { y } \right) \frac { \partial f _ { w , b } ( x ) } { \partial z } \frac { \partial z } { \partial w _ { i } } \\ & = 2 \left( f _ { w , b } ( x ) - \hat { y } \right) f _ { w , b } ( x ) \left( 1 - f _ { w , b } ( x ) \right) x _ { i } \end{aligned} $$

不論 $\hat y^n = 1$ 或 $\hat y^n = 0$,都可能導致 $\partial L / \partial w _ { i } = 0$,無法有效更新參數。

Generative v.s. Discriminative

  • Benefit of generative model
    • With the assumption of probability distribution, less training data is needed
    • With the assumption of probability distribution, more robust to the noise
    • Priors and class-dependent probabilities can be estimated from different sources.

Multi-class Classification

我們用 3 個 classes 來做例子:

Limitation of Logistic Regression

給定上述的 4 個 features,我們無法有效的分類,因為沒有任何線性的切法是可以完美將紅點和藍點分開,因此我們有以下兩種方法:

  • Feature Transformation: Not always easy to find a good transformation
  • Cascading logistic regression models